Close Menu
TechUpdateAlert

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Get this Ryzen 7 mini PC with 32GB RAM for a new low price: $339

    August 14, 2025

    How to silence auto-playing previews on the new Netflix app

    August 14, 2025

    Insta360’s next ultra-compact action camera is getting a redesign

    August 14, 2025
    Facebook X (Twitter) Instagram
    Trending
    • Get this Ryzen 7 mini PC with 32GB RAM for a new low price: $339
    • How to silence auto-playing previews on the new Netflix app
    • Insta360’s next ultra-compact action camera is getting a redesign
    • Visible is now knocking $6 off its unlimited plans for summer
    • The iPhone 17 Hasn’t Been Announced Yet, but There’s Already a Case for It
    • Best Internet Providers in San Antonio, Texas
    • The Fairphone (Gen. 6) Review: Better Than Ever
    • Apple Reportedly Planning AI Comeback, Complete With a Tabletop Robot
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechUpdateAlertTechUpdateAlert
    • Home
    • Gaming
    • Laptops
    • Mobile
    • Software
    • Reviews
    • AI & Tech
    • Gadgets
    • How-To
    TechUpdateAlert
    Home»Reviews»How to Make LLMs Run Faster in Ollama on Your Windows 11 PC
    Reviews

    How to Make LLMs Run Faster in Ollama on Your Windows 11 PC

    techupdateadminBy techupdateadminAugust 12, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Ollama stats in Windows Terminal showing eval rate and the split of CPU and GPU being used.
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Recently, I’ve been playing around with Ollama as a way to use AI LLMs locally on my Windows 11 PC. Beyond educating myself, there are some good reasons to run local AI over relying on the likes of ChatGPT and Copilot.

    Education is the big thing, though, because I’m always curious and always looking to expand my knowledge base on how new technology like this works. And how I can make it work for me.

    This past weekend, though, I certainly educated myself on one aspect: performance. Just last week, Ollama launched a new GUI application that makes it easier than ever to use it with LLMs, without needing third-party add-ons or diving into the terminal.


    You may like

    The terminal is still a useful place to go, though, and it’s how it dawned on me that the LLMs I’ve been using on my machine haven’t been working as well as they could have been.

    It’s all about the context length.

    Context length is a key factor in performance

    Context length is a factor whether you use local AI or ChatGPT. (Image credit: Windows Central)

    So, what is context length, and how does it affect the performance of AI models? The context length (also known as the context window) defines the maximum amount of information a model can process in one interaction.

    Simply put, a longer context length allows the model to ‘remember’ more information from a conversation, process larger documents, and, in turn, improve overall accuracy through longer sessions.

    The trade-off is that a longer context length needs more horsepower and results in slower response times. Context is measured in tokens, which is a better label for snippets of text. So the more text, the more resources needed to process it.

    Ollama can go up to a 128k context window, but if you don’t have the necessary hardware to run support, your machine can slow to a crawl. As I found out, trying to get gpt-oss:20b to complete a test designed for 10 and 11-year-old children.

    In this example, a shorter context length didn’t provide the model enough to process the whole test properly, while a larger one did, at the detriment of performance.

    ChatGPT has a context length of over 100,000 tokens, but it has the benefit of OpenAI’s massive servers behind it.

    Keeping the right context length for the right tasks is crucial for performance

    Changing context length in the Ollama GUI app.

    Ollama’s new GUI app lets you easily change context length on the fly. (Image credit: Windows Central)

    It’s a bit of a pain to have to keep switching the context length, but doing so will make sure you’re getting a lot more from your models. This is something I haven’t been keeping tabs on, but once the lightbulb lit up, I’ve been having a significantly better time.

    It dawned on me while I was poking around in the terminal with some Ollama settings to see just how fast a model was running. I hadn’t realized that gpt:oss20b was set to 64k context length, and was surprised at getting a 9 token per second eval rate to a simple question.

    “How much wood would a woodchuck chuck if a woodchuck could chuck wood?”

    It also wasn’t using any of my GPU’s VRAM at this length, seemingly refusing to load it into the memory, and was running entirely using the CPU and RAM. By reducing to 8k, the same question generated a 43 tokens per second response. Dropping to 4k doubled this to 86 tokens per second.

    The added benefit (with this model at least) here is that the GPU was being fully utilized at 4k, and 93% used at 8k.

    The key thing here is that unless you’re actually dealing with large amounts of data or very long conversations, setting your context length lower will yield far better performance.

    It’s all about balancing what you’re trying to get from a session against a context length that can handle it with the best efficiency.

    How to change Ollama context length and save it for future use

    Setting a context length and saving a model using the Ollama CLI.

    From the CLI, you can not only set context length, but save a version of that model for quicker access in future. (Image credit: Windows Central)

    There are two ways you can change the context length in Ollama, one in the terminal and one in the new GUI application.

    The latter is the simplest. You simply go into settings and move the slider between 4k and 128k. The downside is these are fixed intervals; you can’t choose anything in between.

    But, if you use the terminal, the Ollama CLI will give you more freedom over choosing the exact context length you may want, as well as the ability to save a version of the model with this context length attached.

    Why bother with this? Well, firstly, it’ll give you a model with that length permanently attached; you don’t need to keep changing it. So it’s convenient. But it also means you can have multiple ‘versions’ of a model with different context lengths for different use cases.

    To do this, open up a PowerShell window and run the model you want to use with this command:

    ollama run 

    Once the model is running, and you’re inside its CLI environment, change the context length following this template. Here, I’m applying an 8k context length.

    /set parameter num_ctx 8192

    The number at the end is where you decide how long you want it to be. You’ll now use that context length, but if you want to save a version you can launch, next enter this command:

    /save  

    Ollama listing all currently available local LLMs it can use, including a saved version of gpt-oss:20b with an 8k context length.

    After saving a version of gpt-oss:20b with an 8k context length, I can now launch it directly through Ollama whenever I want to use it. (Image credit: Windows Central)

    Now you can launch this saved model either through the CLI or the GUI app, or use it in any other integrations you have using Ollama on your PC.

    The downside is that the more versions you make, the more storage you use. But it’s a convenient way if you have the storage available to avoid having to think about changing context length.

    You can have one set low for best performance when you’re doing smaller tasks, one set higher for more intensive workloads, whatever you want, really.

    How to check performance of a model in Ollama

    Using the verbose flag and ps command to see how fast a model is running in Ollama and the split of CPU and GPU usage.

    Using Ollama CLI you can see how fast your model is running and also the split of CPU and GPU usage while doing it. (Image credit: Windows Central)

    I’ll close by showing how you can see the split of CPU and GPU usage for a model in Ollama, as well as how many tokens per second a model is spitting out.

    I recommend playing around with this to find your own sweet spot based on the model you’re using and the hardware at your disposal.

    For now, if you only have Ollama installed and no third-party tools, you’ll need to use the terminal. When running a model, add the –verbose tag to the command. Example:

    ollama run gemma3:12b --verbose

    What this does is generate a report after the response detailing a number of performance metrics. The easiest one to look at is the last eval rate, and the higher the tokens per second, the faster the performance.

    If you want to see how much spread across CPU and GPU the model is using, you’ll need to back out of it using the /bye command first. Then type ollama ps and you’ll get a breakdown of CPU and GPU percentage. In an ideal world, you want the GPU percentage to be 100% or as close to it as possible.

    As an example, I have to go below an 8k context length with gpt-oss:20b on my system with an RTX 5080 if I want to use 100% GPU. At 8k, it’s ‘only’ 93%, but that’s still perfectly fine.


    Hopefully, this helps some folks out there, especially newcomers to Ollama, since I was definitely leaving performance on the table. With a beefy GPU, I simply assumed all I had to do was run a thing, and it’d magically just do magic.

    Even with an RTX 5080, though, I’ve needed to tweak my expectations and my context length to get the best performance. Generally, I’m not (yet) feeding these models massive documents, so I don’t need a massive context length, and I certainly don’t need one (and neither do you) for shorter queries.

    I’m happy to make mistakes so you don’t have to!

    faster LLMs Ollama Run Windows
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOneNote finally gets a popular keyboard shortcut that was missing
    Next Article Sling TV Offers $5 Day Passes for Casual Viewers
    techupdateadmin
    • Website

    Related Posts

    Reviews

    Get this Ryzen 7 mini PC with 32GB RAM for a new low price: $339

    August 14, 2025
    Reviews

    The Fairphone (Gen. 6) Review: Better Than Ever

    August 14, 2025
    Gaming

    Microsoft fixed 100+ security flaws in Windows and Office this month

    August 13, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Apple Pencil With ‘Trackball’ Tip, Ability to Draw on Any Surface Described in Patent Document

    July 9, 20253 Views

    Samsung Galaxy Z Fold 7 and Galaxy Z Flip 7: First Impressions

    July 9, 20253 Views

    The Bezos-funded climate satellite is lost in space

    July 9, 20252 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Best Fitbit fitness trackers and watches in 2025

    July 9, 20250 Views

    There are still 200+ Prime Day 2025 deals you can get

    July 9, 20250 Views

    The best earbuds we’ve tested for 2025

    July 9, 20250 Views
    Our Picks

    Get this Ryzen 7 mini PC with 32GB RAM for a new low price: $339

    August 14, 2025

    How to silence auto-playing previews on the new Netflix app

    August 14, 2025

    Insta360’s next ultra-compact action camera is getting a redesign

    August 14, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact us
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    © 2025 techupdatealert. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.