The AI Enthusiast — Local AI Models on Apple Silicon · Issue No. 06

Coding Models

Qwen3 Coder

Alibaba · Code-optimized MoE

Qwen3-Coder-30B-A3B

MoE ~18 GB

Qwen3-Coder-Next 80B/3B

70.6% SWE ~48 GB

Mistral / Devstral

Mistral AI · Code-first

Codestral 25.01

86.6% HE ~13 GB

Devstral Small 2

24B New ~14 GB

Devstral 2 (123B, 72.2% SWE) too large for consumer Mac.

Reasoning Models

Qwen 3.5

Alibaba · Feb–Mar 2026 · Sonnet-class

Qwen3.5-27B

Dense ~16 GB

Qwen3.5-35B-A3B

MoE ~20 GB

Qwen3.5-122B-A10B

MoE ~70 GB

Qwen3.5-397B-A17B

Frontier ~230 GB

DeepSeek R1

DeepSeek · Dense Distillations

R1-Distill-14B

14B ~9 GB

R1-Distill-32B

32B ~20 GB

R1-Distill-70B

Best Value ~40 GB

Q8 = ~70 GB. R2/V4 not yet released.

Phi-4

Microsoft · Beats DS-R1-70B on AIME

Phi-4-reasoning-plus

77.7% AIME ~10 GB

Phi-4-reasoning-vision

15B multi ~10 GB

68.8% LiveCodeBench. Outstanding at 14B.

Llama 4

Meta · MoE · April 2025

Llama 4 Scout

109B/17B ~12 GB*

Llama 4 Maverick

400B/17B ~230 GB

*Scout: 17B active, 10M context. Full weights needed.

Gemma 3 27B Dense · ~15 GB

Google · MLX-optimized · QAT models available · 20-50% faster than llama.cpp natively

macOS Memory Overhead

Metal GPU caps VRAM at ~75-80% of unified memory. Reserve 8-12 GB for macOS + apps.

64 GB

Total Unified Memory

~50 GB usable

128 GB

Total Unified Memory

~100 GB usable

256 GB

Total Unified Memory

~220 GB usable

sysctl tweak (iogpu.wired_limit_mb) can raise GPU limit on dedicated machines. Model ≤60-70% of total RAM for KV cache + context headroom.

RAM Footprint at Q4 — vs Usable Memory

Phi-4 Reasoning+

10

Llama 4 Scout

12

Codestral 25.01

13

Devstral Small 2

14

Qwen3-Coder-30B

18

DS R1-Distill-32B

20

DS R1-Distill-70B

40

Qwen3-Coder-Next

48

DS R1-70B (Q8)

70

Qwen3-235B-A22B

143

Qwen3.5-397B

230

Memory Bandwidth & Tokens/Second

Token generation is memory-bandwidth-bound. tok/s ≈ bandwidth ÷ model_size_bytes. Prefill is compute-bound (M5 neural accelerators: 4x faster TTFT).

Chip	Bandwidth	14B Q4	32B Q4	70B Q4	70B Q8
M4 Pro	273 GB/s	~45 t/s	~22 t/s	won't fit	won't fit
M4 Max	546 GB/s	~80 t/s	~40 t/s	~28 t/s	~16 t/s
M5 Max	614 GB/s	~90 t/s	~48 t/s	~32 t/s	~18 t/s
M3 Ultra	800 GB/s	~110 t/s	~55 t/s	~50 t/s	~28 t/s
M5 Ultra (exp.)	~1,228 GB/s	~140 t/s	~80 t/s	~65 t/s	~36 t/s

Approximate MLX tok/s for token generation. llama.cpp: 20-50% slower. MoE models faster (only active params read).

Best Models by Mac Tier

Mac Mini

M4 Pro

64 GB · 273 GB/s

~50 GB usable

~$1,400

vs Sonnet~70%

Qwen3-Coder-30B-A3B 69.6% SWE · MoE · 18 GB ~45 t/s (MoE, 3B active)
Phi-4-reasoning-plus 77.7% AIME · 10 GB ~45 t/s
Codestral 25.01 86.6% HumanEval · 13 GB ~35 t/s

Best combo: Qwen3-Coder-30B + Phi-4 together (28 GB, room to spare)

Recommended

MacBook Pro

M5 Max (40-core GPU)

128 GB · 614 GB/s

~100 GB usable

4x faster TTFT (neural accel)

~$5,100

vs Sonnet~85%

DeepSeek R1-Distill-70B (Q8) Near-frontier reasoning · 70 GB · 30 GB headroom ~18 t/s gen · 4x faster prefill vs M4
Qwen3.5-122B-A10B Sonnet-class MoE · ~70 GB ~60 t/s (MoE, 10B active)
Qwen3-Coder-Next 80B/3B 70.6% SWE · 48 GB · leaves room for 2nd model ~90+ t/s (MoE, 3B active)

Best: DS R1-70B Q8 (70 GB) + Phi-4 (10 GB) = 80 GB, 20 GB for KV cache. Genuine autonomous coding.

Mac Studio

M5 Ultra (expected mid-2026)

256 GB · ~1,228 GB/s

~220 GB usable

2x bandwidth of M5 Max

~$8,000 – $10,000

vs Sonnet~90-95%

Qwen3-235B-A22B (Q4) Frontier MoE · 143 GB · 77 GB to spare ~55 t/s (MoE, 22B active)
Qwen3.5-397B-A17B Latest flagship · ~230 GB · tight fit ~70 t/s (MoE, 17B active)
DS R1-70B FP16 + Coder-Next Full precision 140 GB + 48 GB = 188 GB ~36 t/s (70B) · both models simultaneously

Best: Qwen3-235B-A22B (143 GB) + Coder-Next (48 GB) = 191 GB. Frontier reasoning + coding, fully local.

Inference Stack — Performance Hierarchy

Runner	Best For	Speed
vllm-mlx	Production serving, continuous batching, paged KV cache	Fastest overall · 2x llama.cpp · 6x Ollama (8 concurrent)
MLX (mlx-lm)	Single-stream inference, zero-copy unified memory	20-50% faster than llama.cpp · purpose-built for Metal
MLC-LLM	Cross-platform deployment	Slightly behind MLX
Ollama	Easiest setup, OpenAI-compatible API	llama.cpp backend · best DX for agent integration
llama.cpp	Portable C++ with Metal backend	Good, but Metal is an add-on — not native like MLX
llamafile	Single-file distribution (cosmopolitan libc)	Same as llama.cpp · convenience, not speed

No C++ engine beats MLX on Apple Silicon. MLX was built by Apple specifically for unified memory + Metal. llama.cpp added Metal support retroactively.

Solo Developer Recommendation

M5 Max MacBook Pro
128 GB is the sweet spot.

DS R1-70B Q8 (70 GB) + Phi-4 (10 GB) = 80 GB, leaving 48 GB for KV cache + context. 614 GB/s bandwidth gives ~18 t/s on 70B. Neural accelerators cut TTFT by 4x vs M4.

~$5,200

TOTAL STARTUP

ROI in ~17 months vs API

Hardware

MacBook Pro M5 Max

128 GB · 614 GB/s · 40-core GPU

Stack

MLX + Ollama

vllm-mlx for serving

Primary Reasoning

DS R1-Distill-70B (Q8)

~70 GB · ~18 t/s

Primary Coding

Qwen3-Coder-Next

~48 GB · ~90+ t/s (MoE)

Fast Triage

Phi-4-reasoning-plus

14B · ~10 GB · ~90 t/s

API Budget

$20–50/mo for hard tasks

Claude / frontier calls

Why not 64 GB Mac Mini?

~50 GB usable. 273 GB/s = slow on 70B. 32B models need babysitting. Good as a secondary inference server.

Why not 256 GB Mac Studio?

M5 Ultra not shipping yet (mid-2026). Double the cost for ~10-15% better output. Buy when revenue justifies it.