❓ Quick FAQ
How are tok/s estimates calculated?
Token generation speed is memory-bandwidth bound: each decode step reads the active model weights plus the accumulated KV cache from GPU memory. The full formula is tok/s = bandwidth / (active_model_size + kv_cache). For short contexts, KV cache is small and the simplified form tok/s ≈ bandwidth / model_size applies. For MoE models, only the active expert weights are read per token, so speed uses the active parameter count. All estimates include a ±15–30% uncertainty band to reflect real-world variation from drivers, thermal throttling, and inference runtime.
Are these real benchmarks or heuristics?
Both, transparently. Model quality scores (MMLU-PRO, MATH, IFEval, etc.) come directly from the Open LLM Leaderboard ; these are real, reproducible benchmark runs. Speed estimates are physics-based heuristics derived from GPU spec sheets (memory bandwidth, TFLOPS). They are not measured on-device, but the underlying formula is the same one used by llama.cpp and the LLM community. Want verified numbers? Submit your own results via the community benchmarks page.
Model Size Estimation
The memory footprint of a model depends on its parameter count and quantization level (bits per weight). For each model, we store pre-computed VRAM values derived from actual GGUF file measurements, which account for format overhead, embedding tables, and architecture-specific tensor layouts. When measured values are unavailable, we fall back to:
Example: A 7B model at Q4_K_M (4.5 bpw) quantization: 7 × 4.5 / 8 × 1024 = 4,032 MB ≈ 3.9 GB. The actual GGUF file is typically 5–15% larger due to format and tensor overhead.
KV Cache
The KV (Key-Value) cache stores attention states for every token in the context window and grows linearly with context length. The theoretical formula for KV cache per token is:
The factor of 2 accounts for both Key and Value tensors. For models using Grouped-Query Attention (GQA), n_kv_heads is much smaller than the number of attention heads (typically 8 regardless of model size), which keeps the KV cache sub-linear with parameter count.
Since our dataset does not include per-model architecture details (layer count, head counts, head dimension), we use an empirical power-law approximation calibrated against measured KV cache sizes:
The base multiplier of 128 MB per 1K tokens corresponds to a 7B GQA model at FP16 (verified: 32 layers × 8 KV heads × 128 head_dim × 2 bytes × 2 tensors = 128 KB/token = 128 MB/1K). The 0.4 exponent reflects that layer count grows slower than parameter count: larger models widen their hidden dimensions rather than just stacking more layers, while GQA keeps KV heads fixed. This fits measured values to within 5% for 7B–70B models.
Why context length matters for VRAM
KV cache can become the dominant memory consumer at long context lengths. A 70B model at 128K context needs approximately 40 GB of additional KV cache alone, comparable to the model weights at Q4 quantization (~40 GB). This is why a model that fits in VRAM at 4K context may not fit at 32K or 128K, even when the base model size is well within the GPU's capacity.
At longer contexts, reading the KV cache during each decode step also impacts generation speed. The full decode step reads both the model weights and the accumulated KV cache from memory, so the effective formula becomes: tok/s = bandwidth / (model_size + kv_cache_size). Our speed estimates assume a short context window; expect slower generation at 32K+ contexts.
Mixture-of-Experts (MoE) Models
MoE models contain multiple “expert” sub-networks but only activate a fraction of them per token. This creates a fundamental distinction between VRAM requirements and inference speed:
VRAM: Total Parameters
All expert weights must be loaded into memory, since any expert may be activated at any time. A 671B-parameter MoE model needs VRAM for all 671B parameters.
Speed: Active Parameters
Only the active expert weights (plus shared attention layers) are read from memory per token. A 671B model with 37B active params achieves decode speed as if it were a ~37B dense model.
Every MoE model in our database includes an explicit active_params_b value sourced from the model's official documentation or architecture paper. This ensures accurate speed estimates:
Example: DeepSeek R1 (671B total, 37B active) at Q4 on an H100 (3,350 GB/s): VRAM needed = 671 × 4.5 / 8 ≈ 377 GB (requires multi-GPU), but decode speed = 3350 / (37 × 4.5 / 8) ≈ 161 tok/s per shard (before multi-GPU overhead).
Inference Modes
We determine the best way to run each model based on available memory:
Entire model fits in VRAM (with 10% headroom for KV cache growth and runtime overhead). Expect 30–100+ tokens/second depending on GPU bandwidth.
Part of the model runs on GPU, rest is offloaded to system RAM. Speed is limited by the serial pipeline through GPU and CPU layers. Requires at least 2 GB of free RAM beyond the offloaded portion. Expect 5–30 tokens/second.
Entire model runs on CPU using system RAM. Speed depends on RAM bandwidth and CPU capabilities (ISA extensions, core count, clock speed). Expect 1–30 tokens/second (higher with DDR5 and AVX-512/AMX).
Token Generation Speed
Token generation (decode) is memory-bandwidth bound. Each decode step reads the active model weights plus the accumulated KV cache from GPU memory. The full theoretical formula is:
For short contexts (≤4K tokens), the KV cache is small relative to the model weights, so the simplified form tok/s ≈ bandwidth / active_model_size is a close approximation. For MoE models, only the active expert weights are read per token, not the full model.
Example: RTX 4090 (1,008 GB/s bandwidth) running a 70B Q4 dense model (~39 GB): 1008 / 39 ≈ ~26 tokens/second (at short context). At 32K context, the KV cache adds ~10 GB, reducing speed to 1008 / 49 ≈ ~21 tok/s.
Why bandwidth matters more than compute
During generation, each token requires reading the active model weights (and KV cache) from memory. Modern GPUs have far more compute power than needed; the bottleneck is how fast you can feed data to the cores. This is why the RTX 4090 and RTX 3090 have similar LLM performance despite the 4090 having much more compute: their memory bandwidth is comparable.
GPU + RAM Offload Speed
When a model doesn't fully fit in VRAM, some layers run on GPU and the rest on CPU/RAM. These layers process sequentially, not in parallel: the total time per token is the sum of both parts, not an average:
CPU layers read weights directly from system RAM, so the bottleneck is RAM bandwidth (not PCIe). PCIe only carries the small activation vectors (~8–32 KB) between GPU and CPU layer groups, adding a small latency overhead (~0.1–0.2 ms per token). Faster PCIe generations (4.0, 5.0) reduce this latency.
Why offload is much slower than full GPU
Even with 80% of layers on GPU, the remaining 20% on CPU creates a serial bottleneck. System RAM typically provides 40–80 GB/s (DDR4/DDR5) of effective bandwidth, compared to hundreds of GB/s for GPU memory. The overall speed is dominated by the slowest stage in the pipeline.
RAM Bandwidth
For CPU inference and GPU+RAM offload modes, system RAM bandwidth is a critical factor. We estimate it from the DDR specification and apply a real-world utilization factor:
The 75% utilization factor accounts for memory controller overhead, refresh cycles, cache line alignment, NUMA effects, and contention with the OS and other processes. For example, DDR5-6400 in dual-channel configuration: 6400 × 8 × 2 / 1000 × 0.75 ≈ 77 GB/s effective.
Uncertainty Ranges
All performance estimates are shown as ranges rather than single numbers. Real-world performance varies due to inference runtime (Ollama, llama.cpp, vLLM), driver version, thermal throttling, background system load, and model-specific optimizations.
These bands reflect the inherent variance in real-world setups. For precise numbers, we recommend running your own benchmarks and submitting them to our community benchmarks page.
Prefill Speed & Time to First Token (TTFT)
Processing the input prompt (prefill) is compute-bound, not bandwidth-bound. The forward pass requires approximately 2 FLOPs per active parameter per token:
Utilization is typically 30% without tensor cores, 60% with tensor cores. For MoE models, the active parameter count is used since only the selected experts participate in each forward pass.
Time to First Token (TTFT) is derived from prefill speed:
The full theoretical TTFT also includes a memory-loading component (model_size / bandwidth), but since prefill is compute-bound for typical prompt lengths (100+ tokens), the memory component is overlapped with computation by GPU pipelining and does not add to total latency. For very short prompts (<30 tokens), memory loading may dominate; our estimates are optimistic in that regime.
Multi-GPU Scaling
Multiple GPUs can combine their VRAM and bandwidth via tensor parallelism, but with overhead from inter-GPU communication:
With NVLink
High-speed direct GPU-to-GPU connection.
- • Effective VRAM: 95% of total
- • Effective bandwidth: 90% scaling per GPU
Without NVLink (PCIe)
Communication through system bus.
- • Effective VRAM: 85% of total
- • Effective bandwidth: ~30% bonus per additional GPU
The bandwidth formula for PCIe multi-GPU is: effective_bw = base_bw × (1 + (gpu_count − 1) × 0.3). This conservative estimate reflects the PCIe synchronization overhead during tensor-parallel inference. Apple Silicon devices do not support multi-GPU configurations (Ultra variants achieve high memory capacity through a single unified memory pool).
CPU Inference
CPU inference speed is primarily limited by system RAM bandwidth. The base speed is calculated the same way as GPU inference (bandwidth divided by active model size), then adjusted by CPU-specific multipliers:
Intel AMX (Advanced Matrix Extensions) provides significant acceleration for matrix operations, making newer Intel CPUs notably faster for LLM inference. CPU inference speed is capped at 60 tok/s to reflect practical limits.
Hardware Recommendations
When recommending hardware for a specific model (“Find Hardware”), we evaluate every GPU in our database across single and multi-GPU configurations (1, 2, 4, and 8 GPUs). For each viable option:
- 1.VRAM check: The model's pre-computed VRAM value (from measured GGUF data) must fit within the GPU's total VRAM (with multi-GPU efficiency scaling applied).
- 2.Speed estimate: Decode tokens/sec is calculated from the GPU's memory bandwidth and the model's active parameter count (accounting for MoE architecture).
- 3.Tier classification: Options are categorized as Budget (<15 tok/s), Recommended (15–40 tok/s), or Premium (40+ tok/s) and filtered by user preferences (minimum speed, maximum budget).
The final recommendations include a minimum viable option (cheapest that can physically run the model), a best value option (highest tok/s per dollar), and a premium option (fastest within a reasonable price range). Cloud alternatives (RunPod, Vast.ai) are shown alongside local hardware for comparison. All GPU vendors are supported: NVIDIA, AMD, Intel Arc, and Apple Silicon.
Final Score Calculation
Each model's final score combines three factors with weights that vary by use case:
Benchmark scores are renormalized when some benchmarks are missing for a model, so models with partial coverage are not unfairly penalized. The best available quantization is selected per model: we prefer GPU Full inference over offload or CPU, and within the same inference mode, higher-quality quantization is preferred.
Weights by Use Case
| Use Case | Quality | Speed | Quant | Key Benchmarks |
|---|---|---|---|---|
| Chat | 45% | 30% | 25% | IFEval, MMLU-PRO, BBH, GPQA |
| Coding | 55% | 25% | 20% | HumanEval, BigCodeBench, MATH, IFEval |
| Reasoning | 60% | 15% | 25% | MATH, GPQA, BBH, MUSR |
| Creative | 40% | 35% | 25% | IFEval, MMLU-PRO, BBH |
| Vision | 50% | 25% | 25% | IFEval, MMLU-PRO, BBH, GPQA |
| Roleplay | 35% | 35% | 30% | IFEval, MMLU-PRO, BBH |
| Embedding | 70% | 10% | 20% | MMLU-PRO, BBH, IFEval |
Data Sources
- •Model benchmarks: Open LLM Leaderboard on HuggingFace
- •GPU specifications: Official NVIDIA, AMD, Apple, Intel spec sheets (206+ GPUs)
- •Model VRAM: Pre-computed from actual GGUF file measurements, with formula fallback based on llama.cpp memory estimation
- •MoE active parameters: Sourced from official model cards and architecture papers for 137 of 156 MoE models; remainder uses a 20% heuristic estimate
- •CPU specifications: Intel, AMD, Apple Silicon spec sheets (78 CPUs)
Limitations
- • KV cache formula uses a power-law GQA approximation (±5% for 7B–70B); models with non-standard KV head counts may differ
- • Decode speed estimates assume short context (≤4K tokens); at longer contexts, KV cache reads reduce throughput
- • We assume uniform layer sizes; real models may have varying layer dimensions
- • The 10% VRAM headroom is conservative; some systems can use more
- • CPU inference speed is capped at 60 tok/s; server-class CPUs with multi-channel DDR5 may exceed this
- • Real performance varies by inference engine (llama.cpp vs vLLM vs Ollama vs SGLang)
- • Benchmark scores reflect published evaluations; real-world quality may differ for specific tasks
- • MoE active parameter counts are sourced from documentation for 88% of MoE models; 12% use a 20% heuristic estimate
- • Multi-GPU estimates use conservative PCIe scaling factors; NVLink-connected systems may perform better than estimated