Local AI Models on Apple Silicon
Autonomous coding agents on your Mac. Coding vs reasoning models, real-world RAM after macOS overhead, and tok/s by memory bandwidth.
Coding Models
Qwen3 Coder
Alibaba · Code-optimized MoE
Qwen3-Coder-30B-A3B
Qwen3-Coder-Next 80B/3B
Mistral / Devstral
Mistral AI · Code-first
Codestral 25.01
Devstral Small 2
Devstral 2 (123B, 72.2% SWE) too large for consumer Mac.
Reasoning Models
Qwen 3.5
Alibaba · Feb–Mar 2026 · Sonnet-class
Qwen3.5-27B
Qwen3.5-35B-A3B
Qwen3.5-122B-A10B
Qwen3.5-397B-A17B
DeepSeek R1
DeepSeek · Dense Distillations
R1-Distill-14B
R1-Distill-32B
R1-Distill-70B
Q8 = ~70 GB. R2/V4 not yet released.
Phi-4
Microsoft · Beats DS-R1-70B on AIME
Phi-4-reasoning-plus
Phi-4-reasoning-vision
68.8% LiveCodeBench. Outstanding at 14B.
Llama 4
Meta · MoE · April 2025
Llama 4 Scout
Llama 4 Maverick
*Scout: 17B active, 10M context. Full weights needed.
Gemma 3 27B Dense · ~15 GB
Google · MLX-optimized · QAT models available · 20-50% faster than llama.cpp natively
macOS Memory Overhead
Metal GPU caps VRAM at ~75-80% of unified memory. Reserve 8-12 GB for macOS + apps.
64 GB
Total Unified Memory
~50 GB usable
128 GB
Total Unified Memory
~100 GB usable
256 GB
Total Unified Memory
~220 GB usable
sysctl tweak (iogpu.wired_limit_mb) can raise GPU limit on dedicated machines.
Model ≤60-70% of total RAM for KV cache + context headroom.
RAM Footprint at Q4 — vs Usable Memory
Memory Bandwidth & Tokens/Second
Token generation is memory-bandwidth-bound. tok/s ≈ bandwidth ÷ model_size_bytes. Prefill is compute-bound (M5 neural accelerators: 4x faster TTFT).
| Chip | Bandwidth | 14B Q4 | 32B Q4 | 70B Q4 | 70B Q8 |
|---|---|---|---|---|---|
| M4 Pro | 273 GB/s | ~45 t/s | ~22 t/s | won't fit | won't fit |
| M4 Max | 546 GB/s | ~80 t/s | ~40 t/s | ~28 t/s | ~16 t/s |
| M5 Max | 614 GB/s | ~90 t/s | ~48 t/s | ~32 t/s | ~18 t/s |
| M3 Ultra | 800 GB/s | ~110 t/s | ~55 t/s | ~50 t/s | ~28 t/s |
| M5 Ultra (exp.) | ~1,228 GB/s | ~140 t/s | ~80 t/s | ~65 t/s | ~36 t/s |
Approximate MLX tok/s for token generation. llama.cpp: 20-50% slower. MoE models faster (only active params read).
Best Models by Mac Tier
Mac Mini
M4 Pro
64 GB · 273 GB/s
~50 GB usable
~$1,400
vs Sonnet~70%
- Qwen3-Coder-30B-A3B 69.6% SWE · MoE · 18 GB ~45 t/s (MoE, 3B active)
- Phi-4-reasoning-plus 77.7% AIME · 10 GB ~45 t/s
- Codestral 25.01 86.6% HumanEval · 13 GB ~35 t/s
Best combo: Qwen3-Coder-30B + Phi-4 together (28 GB, room to spare)
Recommended
MacBook Pro
M5 Max (40-core GPU)
128 GB · 614 GB/s
~100 GB usable
4x faster TTFT (neural accel)
~$5,100
vs Sonnet~85%
- DeepSeek R1-Distill-70B (Q8) Near-frontier reasoning · 70 GB · 30 GB headroom ~18 t/s gen · 4x faster prefill vs M4
- Qwen3.5-122B-A10B Sonnet-class MoE · ~70 GB ~60 t/s (MoE, 10B active)
- Qwen3-Coder-Next 80B/3B 70.6% SWE · 48 GB · leaves room for 2nd model ~90+ t/s (MoE, 3B active)
Best: DS R1-70B Q8 (70 GB) + Phi-4 (10 GB) = 80 GB, 20 GB for KV cache. Genuine autonomous coding.
Mac Studio
M5 Ultra (expected mid-2026)
256 GB · ~1,228 GB/s
~220 GB usable
2x bandwidth of M5 Max
~$8,000 – $10,000
vs Sonnet~90-95%
- Qwen3-235B-A22B (Q4) Frontier MoE · 143 GB · 77 GB to spare ~55 t/s (MoE, 22B active)
- Qwen3.5-397B-A17B Latest flagship · ~230 GB · tight fit ~70 t/s (MoE, 17B active)
- DS R1-70B FP16 + Coder-Next Full precision 140 GB + 48 GB = 188 GB ~36 t/s (70B) · both models simultaneously
Best: Qwen3-235B-A22B (143 GB) + Coder-Next (48 GB) = 191 GB. Frontier reasoning + coding, fully local.
Inference Stack — Performance Hierarchy
| Runner | Best For | Speed |
|---|---|---|
| vllm-mlx | Production serving, continuous batching, paged KV cache | Fastest overall · 2x llama.cpp · 6x Ollama (8 concurrent) |
| MLX (mlx-lm) | Single-stream inference, zero-copy unified memory | 20-50% faster than llama.cpp · purpose-built for Metal |
| MLC-LLM | Cross-platform deployment | Slightly behind MLX |
| Ollama | Easiest setup, OpenAI-compatible API | llama.cpp backend · best DX for agent integration |
| llama.cpp | Portable C++ with Metal backend | Good, but Metal is an add-on — not native like MLX |
| llamafile | Single-file distribution (cosmopolitan libc) | Same as llama.cpp · convenience, not speed |
No C++ engine beats MLX on Apple Silicon. MLX was built by Apple specifically for unified memory + Metal. llama.cpp added Metal support retroactively.
Solo Developer Recommendation
M5 Max MacBook Pro
128 GB is the sweet spot.
128 GB is the sweet spot.
DS R1-70B Q8 (70 GB) + Phi-4 (10 GB) = 80 GB, leaving 48 GB for KV cache + context. 614 GB/s bandwidth gives ~18 t/s on 70B. Neural accelerators cut TTFT by 4x vs M4.
~$5,200
TOTAL STARTUP
ROI in ~17 months vs API
Hardware
MacBook Pro M5 Max
128 GB · 614 GB/s · 40-core GPU
Stack
MLX + Ollama
vllm-mlx for serving
Primary Reasoning
DS R1-Distill-70B (Q8)
~70 GB · ~18 t/s
Primary Coding
Qwen3-Coder-Next
~48 GB · ~90+ t/s (MoE)
Fast Triage
Phi-4-reasoning-plus
14B · ~10 GB · ~90 t/s
API Budget
$20–50/mo for hard tasks
Claude / frontier calls
Why not 64 GB Mac Mini?
~50 GB usable. 273 GB/s = slow on 70B. 32B models need babysitting. Good as a secondary inference server.
Why not 256 GB Mac Studio?
M5 Ultra not shipping yet (mid-2026). Double the cost for ~10-15% better output. Buy when revenue justifies it.