Optimizing Local LLM Selection: An In-Depth Look at whichllm
Selecting a local Large Language Model (LLM) often feels like a guessing game. Most users fall into the trap of the "biggest model that fits" heuristic—assuming that if a model fits in VRAM, it is automatically the best choice. However, as the ecosystem evolves, newer, smaller models frequently outperform older, larger ones, and quantization levels significantly impact both quality and speed.
whichllm is a command-line tool designed to bridge this gap. Rather than relying on simple size heuristics, it auto-detects system hardware and ranks models based on real-world benchmarks, recency, and hardware compatibility. This approach transforms the process of local LLM deployment from trial-and-error to an evidence-based decision.
Beyond the "What Fits?" Mentality
The core philosophy of whichllm is that fitting a model into VRAM is the easy part; the hard part is determining which of the fitting models is actually the most capable. To achieve this, the tool implements several sophisticated ranking mechanisms:
Evidence-Based Ranking
Instead of a static list, whichllm aggregates data from multiple high-signal sources, including LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, and the Open LLM Leaderboard. To prevent "benchmark gaming" or the use of outdated data, the tool employs a recency-aware system that demotes stale leaderboards along a model's lineage.
Confidence-Graded Scoring
Not all benchmark data is created equal. whichllm tags scores based on their origin:
- Direct: Exact model ID match (highest confidence).
- Variant: Suffix-stripped or instruct variants.
- Base: Inherited from the base model.
- Interpolated: Size-aware interpolation within a model family.
- Self-reported: Uploader-claimed evals (heavily discounted).
Architecture-Aware VRAM Estimation
Memory estimation goes beyond simple weight calculation. The tool calculates VRAM requirements by accounting for:
- Model Weights: Based on parameter count and quantization level.
- KV Cache: GQA (Grouped-Query Attention) and activation overhead.
- Framework Overhead: A baseline buffer (approx. 500MB) for the inference engine.
Key Features and Workflow
whichllm is designed to be scriptable and integrated into existing workflows. Its primary capabilities include:
- Hardware Simulation: Users can plan future purchases by simulating specific GPUs (e.g.,
whichllm --gpu "RTX 4090"). - Reverse Lookup: The
plancommand allows users to determine what hardware is required for a specific model (e.g.,whichllm plan "llama 3 70b"). - Instant Execution: The
runcommand creates an isolated environment viauv, downloads the best GGUF variant, and starts a chat session immediately. - Developer Snippets: The
snippetcommand generates ready-to-run Python code usingllama-cpp-pythonortransformersfor easy integration into applications.
Community Perspectives and Technical Critiques
While the tool has been well-received for its utility, the Hacker News community raised several critical technical points that highlight the complexities of local LLM orchestration.
The "Best" Dilemma
One of the most prominent critiques is the subjectivity of "best." As user @bityard noted, a model's suitability depends entirely on the workload:
The "best" model is not "whatever fits into VRAM." You can do lots of useful stuff with a small CPU-only model... The only way to know whether a model is suitable for your specific task is to try it out for yourself.
Hardware Edge Cases
Users pointed out gaps in hardware detection, particularly regarding unified memory architectures. For instance, @cyanydeez noted that on some Linux setups with AMD GPUs, the tool may only detect reserved memory rather than the total available unified memory, a common issue that even tools like nvtop struggle with.
Performance Nuances
Several users emphasized that a single speed metric (tokens per second) is insufficient. Factors such as KV cache quantization, batch parallelism, and the impact of long context windows on generation speed can cause performance to tank significantly, regardless of the initial benchmark.
Summary of the Scoring Logic
To provide a transparent ranking, whichllm uses a weighted scoring system (0-100):
| Factor | Effect | Description |
|---|---|---|
| Benchmark Quality | Core | Weighted merge of multiple leaderboards |
| Model Size | Up to +35 | $\log_2$-scaled proxy for world knowledge |
| Quantization | Penalty | Multiplicative discount for lower-bit quants |
| Evidence Confidence | $\times 0.55–1.0$ | Discount based on source reliability |
| Runtime Fit | $\times 0.50–1.0$ | Penalty for CPU-only or partial offload |
| Speed | $\pm 8$ | Adjustment based on usability thresholds |
| Source Trust | $\pm 5$ | Bonus for official organizations |
By combining hardware detection with a dynamic, evidence-based ranking engine, whichllm provides a structured way to navigate the fragmented landscape of local LLMs, moving the community closer to a standardized method of model selection.