Hacker News Weighs In: The State of Coding Models

The landscape of AI-assisted coding is in constant flux, with new models and updates emerging at a rapid pace. Staying abreast of these developments, particularly from the perspective of active developers and technical users, is crucial for understanding real-world adoption and sentiment. A recent project, "HN SOTA," aims to do just that by systematically analyzing Hacker News comments to gauge the popularity and user sentiment surrounding various coding models.

This daily updated pipeline sifts through the 200 most popular Hacker News posts, filtering for discussions related to LLMs or coding. It then employs Gemini to identify specific models from the OpenRouter list mentioned in comments and assign a sentiment rating (positive, neutral, negative) to each mention. The results, aggregated over a 10-day trailing window, offer a snapshot of the community's current leanings, providing valuable insights into the practical experiences and opinions of a highly technical audience.

Methodology and its Nuances

The HN SOTA project's approach to data collection is innovative, leveraging community discussions as a pulse on the industry. However, the methodology itself sparked considerable discussion among Hacker News commenters, highlighting both its strengths and potential areas for refinement.

One immediate question concerned the initial LLM used for filtering posts, with a commenter asking, "'Prompts an LLM' -> which LLM?" The project explicitly states Gemini is used for sentiment rating, leading to speculation about whether it was chosen for its perceived neutrality. Other critiques focused on the presentation of data, particularly the readability of graph labels, and suggestions for more dynamic visualizations, such as graphing sentiment over time to observe trends from a model's release.

More fundamentally, the definition of "State of the Art" (SotA) was debated. Several commenters argued that popularity and sentiment, while useful, do not directly equate to technical capability or actual usage. As one user put it, "Just FYI this article seems to define 'start of the art' as 'popular', as measured by 'total mentions and user sentiment', without any bearing on the technical abilities or actual usage of the model." Another suggested downgrading the claim to measuring "visibility rather than performance," and some even suspected "astroturfing" or bot activity influencing metrics. The inherent noise in sentiment classification was also raised as a potential skewing factor.

Suggestions for a more sophisticated methodology included analyzing comments that explicitly compare two models (e.g., 'gpt5.5>opus4.7') and inferring context (e.g., 'ctx:frontend'). There was also a call for aggregating different versions of the same model (e.g., "Claude Opus 4.7" and "Claude Opus Latest") and showing combined metrics for entire model families (e.g., all Claude models vs. OpenAI).

Proprietary Models: A Mixed Bag

The analysis of Hacker News comments reveals a complex sentiment towards leading proprietary models, often characterized by high usage coupled with significant frustrations.

Claude's Challenges

Claude, particularly its Opus versions, garners a high number of mentions, indicating widespread use. However, this popularity is frequently accompanied by negative sentiment. Commenters cited issues such as "API pricing policies and frequent server downtime" as major pain points. One user noted, "while Claude is currently taking the #1 spot in mentions, it carries a lot of negative sentiment due to API pricing policies and frequent server downtime." Specific regressions in model output, particularly with Claude Code, were also highlighted, with one user expressing frustration after experiencing "true, reproducible, incredibly frustrating regressions in model output" post-April 26th updates. The "attitude" of Claude Opus 4.7 was also a point of contention, described as "difficult to enforce checking stuff before answering, and it suppose he knows better than me and reality."

Despite these issues, Claude is still valued for certain tasks. It's often seen as superior for writing text during code reviews and for long-running agents, provided token limits aren't a concern. However, operational failures, such as "shell confusion" or "permissions issues" on Windows, were also reported.

GPT's Strengths and Weaknesses

GPT models, including GPT-5.5 and Codex, also feature prominently. GPT-5.5, in particular, appears to have "more positive feedback" than Claude, with users describing it as "amazing, a true jump" that approaches the task inference capabilities of Anthropic models. It's often preferred for sheer code-writing capability. However, it's not without its drawbacks, including a "reduced context window and degradation in compaction." Furthermore, issues with text corruption when generating in non-English languages like Korean or Chinese were noted.

Codex, while sometimes freezing on Windows, is appreciated for its efficiency, burning "very less tokens as compared to Claude Code for the same task." Many users find value in combining models, using Claude for initial generation and then verifying or finishing tasks with Codex.

Gemini's Struggle

Gemini, despite being used by the project's author for sentiment analysis, generally receives a lukewarm to negative reception from commenters. One blunt assessment stated, "Gemini is pretty much unusable," while another commenter questioned, "So no one's using Gemini on HN?" This suggests a significant gap in perceived utility or performance compared to its competitors in the coding model space.

The Rise of Open-Source Alternatives

A significant trend highlighted by the Hacker News community is the growing appreciation and positive sentiment towards open-source models, driven by concerns over vendor lock-in, cost, and the inconsistent performance of proprietary solutions.

Models like Qwen and DeepSeek are frequently mentioned in the context of "guarding against vendor lock-in," which contributes to their positive sentiment. Commenters expressed hope that "the trend of people appreciating open models continue." DeepSeek Flash, in particular, was lauded for its efficiency and cost-effectiveness, offering good performance at a fraction of the price of some proprietary models.

Kimi K2.6 emerged as a standout, described as "great for the API price, efficient, fast and does very well on all my metrics." One commenter was particularly impressed, stating, "Amazing how Kimi has no negative feedback." This model is seen as a balanced product across various performance areas, demonstrating that open-weight companies can deliver competitive solutions.

Other open-source models, such as Gemma 4 (specifically the 26B-A4B model), are gaining traction for local inference, offering "surprisingly good results (both in inference speed and code quality)." However, some top-performing open-weight models, like MiMo V2.5 Pro, were noted for not receiving as much attention despite their capabilities, suggesting a disparity between performance and public hype.

Many commenters believe that open-source models, combined with open-source harness setups, represent the "real salvation" from the "wildly unpredictable but systematic performance of large models like Opus and ChatGPT." The sentiment is that as proprietary models face

Hacker News Weighs In: The State of Coding Models

Hacker News Weighs In: The State of Coding Models

Methodology and its Nuances

Proprietary Models: A Mixed Bag

Claude's Challenges

GPT's Strengths and Weaknesses

Gemini's Struggle

The Rise of Open-Source Alternatives

References

HN Stories