Optimizing Local LLM Selection: An In-Depth Look at whichllm

Selecting a local Large Language Model (LLM) often feels like a guessing game. Most users fall into the trap of the "biggest model that fits" heuristic—assuming that if a model fits in VRAM, it is automatically the best choice. However, as the ecosystem evolves, newer, smaller models frequently outperform older, larger ones, and quantization levels significantly impact both quality and speed.

whichllm is a command-line tool designed to bridge this gap. Rather than relying on simple size heuristics, it auto-detects system hardware and ranks models based on real-world benchmarks, recency, and hardware compatibility. This approach transforms the process of local LLM deployment from trial-and-error to an evidence-based decision.

Beyond the "What Fits?" Mentality

The core philosophy of whichllm is that fitting a model into VRAM is the easy part; the hard part is determining which of the fitting models is actually the most capable. To achieve this, the tool implements several sophisticated ranking mechanisms:

Evidence-Based Ranking

Instead of a static list, whichllm aggregates data from multiple high-signal sources, including LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, and the Open LLM Leaderboard. To prevent "benchmark gaming" or the use of outdated data, the tool employs a recency-aware system that demotes stale leaderboards along a model's lineage.

Confidence-Graded Scoring

Not all benchmark data is created equal. whichllm tags scores based on their origin:

Direct: Exact model ID match (highest confidence).
Variant: Suffix-stripped or instruct variants.
Base: Inherited from the base model.
Interpolated: Size-aware interpolation within a model family.
Self-reported: Uploader-claimed evals (heavily discounted).

Architecture-Aware VRAM Estimation

Memory estimation goes beyond simple weight calculation. The tool calculates VRAM requirements by accounting for:

Model Weights: Based on parameter count and quantization level.
KV Cache: GQA (Grouped-Query Attention) and activation overhead.
Framework Overhead: A baseline buffer (approx. 500MB) for the inference engine.

Key Features and Workflow

whichllm is designed to be scriptable and integrated into existing workflows. Its primary capabilities include:

Hardware Simulation: Users can plan future purchases by simulating specific GPUs (e.g., whichllm --gpu "RTX 4090").
Reverse Lookup: The plan command allows users to determine what hardware is required for a specific model (e.g., whichllm plan "llama 3 70b").
Instant Execution: The run command creates an isolated environment via uv, downloads the best GGUF variant, and starts a chat session immediately.
Developer Snippets: The snippet command generates ready-to-run Python code using llama-cpp-python or transformers for easy integration into applications.

Community Perspectives and Technical Critiques

While the tool has been well-received for its utility, the Hacker News community raised several critical technical points that highlight the complexities of local LLM orchestration.

The "Best" Dilemma

One of the most prominent critiques is the subjectivity of "best." As user @bityard noted, a model's suitability depends entirely on the workload:

The "best" model is not "whatever fits into VRAM." You can do lots of useful stuff with a small CPU-only model... The only way to know whether a model is suitable for your specific task is to try it out for yourself.

Hardware Edge Cases

Users pointed out gaps in hardware detection, particularly regarding unified memory architectures. For instance, @cyanydeez noted that on some Linux setups with AMD GPUs, the tool may only detect reserved memory rather than the total available unified memory, a common issue that even tools like nvtop struggle with.

Performance Nuances

Several users emphasized that a single speed metric (tokens per second) is insufficient. Factors such as KV cache quantization, batch parallelism, and the impact of long context windows on generation speed can cause performance to tank significantly, regardless of the initial benchmark.

Summary of the Scoring Logic

To provide a transparent ranking, whichllm uses a weighted scoring system (0-100):

Factor	Effect	Description
Benchmark Quality	Core	Weighted merge of multiple leaderboards
Model Size	Up to +35	$\log_2$-scaled proxy for world knowledge
Quantization	Penalty	Multiplicative discount for lower-bit quants
Evidence Confidence	$\times 0.55–1.0$	Discount based on source reliability
Runtime Fit	$\times 0.50–1.0$	Penalty for CPU-only or partial offload
Speed	$\pm 8$	Adjustment based on usability thresholds
Source Trust	$\pm 5$	Bonus for official organizations

By combining hardware detection with a dynamic, evidence-based ranking engine, whichllm provides a structured way to navigate the fragmented landscape of local LLMs, moving the community closer to a standardized method of model selection.

Optimizing Local LLM Selection: An In-Depth Look at whichllm

Optimizing Local LLM Selection: An In-Depth Look at whichllm

Beyond the "What Fits?" Mentality

Evidence-Based Ranking

Confidence-Graded Scoring

Architecture-Aware VRAM Estimation

Key Features and Workflow

Community Perspectives and Technical Critiques

The "Best" Dilemma

Hardware Edge Cases

Performance Nuances

Summary of the Scoring Logic

References

HN Stories