Practical Local LLMs: Running Qwen and Gemma on M4 Mac Hardware
The dream of a fully local, private, and internet-independent AI assistant is becoming increasingly attainable. While state-of-the-art (SOTA) frontier models still dominate in raw capability, the emergence of highly efficient open-weight models allows developers to run respectable LLMs on consumer hardware. For those using the M4 MacBook Pro with 24GB of memory, the challenge isn't just about whether a model can run, but whether it can run while leaving enough headroom for the rest of a professional development environment.
Running local models is a balancing act between model size, quantization, and available unified memory. This guide explores the practicalities of this setup, the best models for the job, and the shift in workflow required when moving from cloud-based giants to local assistants.
The Hardware Constraint: The 24GB Threshold
On a Mac with 24GB of unified memory, you are operating in a tight space. You must account for the OS, your IDE, and the inevitable collection of Electron apps (Slack, Discord, VS Code) that consume significant RAM.
Many developers find that models in the 20B+ range—such as GPT-OSS 20B or Devstral Small 24B—technically fit into memory but become unusable in practice, either by triggering heavy swap or leaving no room for the context window. The "sweet spot" for this specific hardware configuration appears to be models in the 4B to 9B range, which provide a balance of intelligence and performance without crashing the system.
Model Recommendation: Qwen 3.5-9B
For those on 24GB hardware, qwen3.5-9b@q4_k_s (4-bit quantization) stands out as a top performer. When run via LM Studio, it can achieve approximately 40 tokens per second, making it feel snappy and responsive.
Key Configuration for Coding and "Thinking"
To get the most out of Qwen 3.5-9B, especially for precise coding tasks, specific inference settings are recommended:
- Temperature: 0.6
- Top P: 0.95
- Top K: 20
- Min P: 0.0
- Presence Penalty: 0.0
- Repetition Penalty: 1.0
Additionally, enabling "thinking" mode (reasoning) often requires a modification to the prompt template. In LM Studio, adding {%- set enable_thinking = true %} to the template can unlock the model's ability to reason through complex problems before providing a final answer.
Workflow Shift: From "Autopilot" to "Co-Pilot"
One of the most critical realizations when moving to local models is that you cannot treat them like Claude 3.5 Sonnet or GPT-4o. Asking a 9B model to build an entire application in one prompt is a recipe for failure.
Instead, a highly interactive, step-by-step workflow is required. This involves:
- Granular Guidance: Breaking tasks into the smallest possible units.
- Active Babysitting: Reviewing every edit and providing immediate corrective feedback.
- Context Management: Providing specific files and clear signatures rather than relying on the model's global understanding of a large repo.
Interestingly, some developers find this "babysitting" beneficial. By forcing the human to remain engaged in the planning and thinking process, it prevents the cognitive atrophy that can occur when offloading too much to a SOTA model.
Real-World Performance: What Works and What Doesn't
Successes: Simple Refactors and Linting
Local models excel at "savant-like" tasks—instant recall of syntax, command-line flags, and simple refactors. For example, taking a list of linter warnings (like Elixir's credo) and applying the suggested idiomatic fixes across multiple files is a task where a 9B model can be highly efficient.
Failures: Complex State and Tool Use
Where local models struggle is in maintaining state across complex operations. A common failure mode is the "hallucinated action": the model might correctly identify a git conflict in a mix.lock file but then fail to actually perform the edit, instead attempting to run git rebase --continue while the conflict markers are still present.
Community Insights and Alternatives
Feedback from the broader community suggests that while 24GB is a great starting point, the experience scales dramatically with more RAM:
- The 32GB-48GB Tier: Users with M4 Pro (48GB) report that while 9B models are fast, they still struggle with complex logic. They suggest that 32-40GB is the minimum for a truly "usable" local coding system.
- The 128GB Tier: For those with M5 Max or high-end M4 Max machines with 128GB of RAM, models like Gemma 4 31B or Qwen 3.6-35B become viable. These models move beyond "science experiments" and begin to approach the utility of frontier models from a year ago.
- Alternative Hardware: Some users suggest that for dedicated inference, a Linux machine with a high-VRAM GPU (like an RTX 4090) or a Jetson Orin 64GB may offer more flexibility and better performance than the unified memory architecture of the Mac, though the Mac remains the most convenient "all-in-one" solution.
Why Go Local?
Despite the performance gap, the incentives for local LLMs are compelling:
- Privacy and Security: Critical for inventors, lawyers, or corporate developers who cannot risk leaking proprietary code or patent-pending ideas to a cloud provider.
- Offline Capability: The ability to work on a plane or in remote areas without an internet connection.
- Cost: Eliminating monthly subscriptions in favor of a one-time hardware investment.
- Tinkering: The joy of optimizing quants, tweaking temperatures, and experimenting with different harnesses like
pi.devorOpenCode.
In summary, local LLMs on M4 hardware are not a replacement for SOTA cloud models, but they are powerful tools for research, rubber-ducking, and simple automation—provided you are willing to steer the ship.