Running Local LLMs: Ollama vs. vLLM

Running Large Language Models (LLMs) locally on your own hardware has transitioned from a niche hobby to an essential workflow for developers and enterprises. The benefits are clear: complete data privacy, zero API costs, offline capability, and the freedom to experiment with specialized or fine-tuned models.

When it comes to local LLM engines, two frameworks stand out as industry standards: Ollama and vLLM. However, they serve very different purposes and architectures.

In this guide, we’ll compare Ollama and vLLM, outline their pros and cons, guide you through step-by-step installations for Windows, macOS, and Linux, and look at hardware optimization and model selection.

Ollama vs. vLLM: The Core Differences

To choose the right framework, it helps to understand their target use cases:

Ollama is designed for developers and desktop users. It acts as a friendly CLI and desktop application wrapper around llama.cpp. It simplifies downloading, managing, and running models, and runs efficiently on local consumer hardware (macOS, Windows, and Linux CPU/GPUs).
vLLM is designed for high-throughput production serving. It is a Python-based library optimized for server-grade GPUs (though it supports consumer GPUs too). Its signature feature is PagedAttention, which manages memory and keys/values caches dynamically, enabling massive throughput and concurrent requests.

Feature	Ollama	vLLM
Primary Focus	Simplicity, local dev environment	High throughput, production APIs
Underlying Engine	`llama.cpp` (C/C++)	Custom Python engine / PagedAttention
Target Hardware	Apple Silicon, Consumer GPUs, CPUs	NVIDIA GPUs, AMD GPUs
API Format	Custom endpoints & OpenAI-compatible	OpenAI-compatible server
Concurrently Served Requests	Low (queue-based or simple parallel)	High (dynamic batching)
Supported File Formats	GGUF (quantized)	Unquantized, AWQ, GPTQ, FP8/FP16

Pros and Cons

Ollama

Pros:
- Dead Simple: Single-command downloads and startup.
- Multi-Platform: Native installers for macOS, Windows, and Linux.
- Low Memory Footprint: Uses GGUF formats, allowing large models to run on modest VRAM or system RAM (via CPU offloading).
- Agent Integration: Native bindings for tools like LangChain, LlamaIndex, and terminal assistants.
Cons:
- Not Built for Concurrent Scale: Lacks advanced dynamic batching; struggles under high parallel loads.
- Less Flexible for Custom Models: Importing non-GGUF custom models requires writing a Modelfile and manual compilation.

vLLM

Pros:
- Incredible Speed: Up to 10x-20x higher throughput than standard engines under concurrent request loads.
- PagedAttention: Efficient memory management prevents out-of-memory (OOM) errors during long generation sequences.
- Distributed Serving: Native support for Tensor Parallelism (splitting models across multiple GPUs).
- OpenAI Compatibility: Seamless drop-in replacement for OpenAI API clients.
Cons:
- Complex Setup: Requires Python environments, CUDA configurations, and pip dependencies.
- Hardware Demanding: Requires a dedicated GPU with decent VRAM; CPU execution is highly limited.
- No CPU Offloading: Unlike llama.cpp, it cannot split layers between VRAM and system RAM.

Step-by-Step Installation Guides

Choose the framework that fits your workflow. Here are the copy-pasteable setup instructions:

1. Ollama Installation

macOS

Download the native zip archive, unzip, and drag it to your Applications folder:

Download Ollama for Mac

Or install via Homebrew:

brew install ollama

Windows

Download and run the official Windows Installer:

Download Ollama for Windows

Linux

Run the official one-liner script:

curl -fsSL https://ollama.com/install.sh | sh

Running a Model in Ollama

Once installed, open your terminal/command prompt and run:

# Run a coding model
ollama run qwen2.5-coder

# Or run a general assistant model
ollama run llama3.1

Ollama will download the model weights automatically and drop you into an interactive chat interface. You can access the API locally at http://localhost:11434.

2. vLLM Installation (Python & GPU required)

vLLM requires a Linux or Windows (via WSL2) environment with Python 3.9–3.12 and an NVIDIA GPU (compute capability 7.0+).

Step 1: Set up a virtual environment (Linux / WSL2)

# Update package list and install virtualenv
sudo apt update && sudo apt install -y python3-venv python3-pip

# Create and activate environment
python3 -m venv vllm-env
source vllm-env/activate

Step 2: Install PyTorch and vLLM

Make sure you have CUDA installed (version 12.1 is recommended):

# Install vLLM via pip
pip install --upgrade pip
pip install vllm

Step 3: Run the vLLM OpenAI-Compatible Server

Start the server hosting a popular open-weight model:

python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --port 8000

This serves the model at http://localhost:8000/v1 with OpenAI-compatible endpoints (/v1/chat/completions).

Performance Optimization on Local Hardware

To squeeze the best performance out of your workstation:

1. Leverage Quantization

Running raw FP16 models takes massive VRAM. Use compressed weights:

Ollama (GGUF): Ollama runs quantized models by default. Use 4-bit (Q4_K_M) for the best balance of speed and accuracy.

vLLM (AWQ/GPTQ): Install and host AWQ/GPTQ models in vLLM to cut memory requirements in half:

python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct-AWQ \
    --quantization awq \
    --port 8000

2. Configure GPU Memory Settings (vLLM)

By default, vLLM reserves 90% of your GPU memory for cache allocations. If you run into OOM errors or want to run other apps alongside vLLM, adjust the utilization:

# Restrict vLLM to use only 75% of VRAM
--gpu-memory-utilization 0.75

3. Restrict Context Length

Reduce memory allocations by setting a hard limit on context length:

# Limit the context window to 4096 tokens
--max-model-len 4096

Recommended Local Model Choices

Select your model based on your system hardware capacity:

For Coding & Development

Qwen 2.5 Coder (7B or 14B): Currently the state-of-the-art open-weight coding model. Outstanding at multi-file logic and debugging.
DeepSeek Coder (6.7B): Extremely lightweight, highly performant code generator.
Kimi K2.7-Code: Moonshot AI’s massive open-weight coding model. Offers outstanding coding capabilities, but requires hundreds of gigabytes of RAM/VRAM due to its parameter size, typically serving on specialized high-end workstations using vLLM or KTransformers.

For General Chat & Reasoning

Llama 3.1 (8B): Anthropic-grade general reasoning and instruction-following. Perfect for daily workspace agents.
Gemma 2 (9B): Google’s open-weight model, delivering high performance on reasoning benchmarks.
GLM-4 (9B) & GLM-4.7: Highly capable multi-lingual open-weight models from Zhipu AI, optimized for agentic reasoning and tool-calling.

For Ultra-Lightweight Systems (Laptops / No Dedicated GPU)

Llama 3.2 (3B) / Qwen 2.5 (1.5B): Small models that run incredibly fast on standard laptop CPUs. Perfect for simple lint checks and summaries.

Running GLM Models Locally

Can you run GLM models on Ollama and vLLM? Yes, both engines fully support the GLM family.

For Ollama: You can pull and run pre-compiled GGUF models directly:
```
ollama run glm4
```
For vLLM: You can host the model natively using the standard API server command:
```
vllm serve THUDM/glm-4-9b-chat
```

Summary

If you need a simple tool to act as a coding partner on your laptop or local workspace, Ollama is the ideal framework. If you are building a custom developer backend, serving multiple teammates, or running automated batch evaluation pipelines, vLLM provides the performance and scale you need.