Running Frontier LLMs on a 96GB Mac Studio:
REAP, IMatrix, and the Models That Beat DeepSeek V4 Flash

May 25, 2026 · ~15 min read · local-llm mac-studio gguf quantization

The 96GB Question

The M3 Ultra Mac Studio with 96GB of unified memory is a beast. 400 GB/s of bandwidth, a massive shared pool that CPU and GPU both access at full speed. But when you sit down to run a serious LLM locally, you quickly hit a wall: what models actually fit, and which ones are better than what you're paying for on the API?

If you're coming from DeepSeek V4 Flash — 284B total parameters, 13B active, served at FP8 quality — you know the bar is high. Reaching or exceeding that quality from a single Mac Studio requires careful model selection. This guide covers everything I found: REAP-pruned MoE models from Cerebras Research, IMatrix GGUF quantizations from the community, and the concrete math of what fits.

TL;DR

Llama 4 Scout Q4_K_M (67.55 GB, 17B active params) is the single best model for 96GB. 17B active beats DS4 Flash's 13B, it has 10M native context, and it leaves 28GB for KV cache.

Qwen3.5-122B-A10B Q4_K_M (77.62 GB, 10B active) is the runner-up — slightly fewer active params than DS4 Flash, but Qwen3.5's architecture is excellent and 122B total params means deep knowledge.

GLM-4.5-Air-REAP-82B-A12B Q4 (~41 GB, 12B active) is the REAP experiment worth trying — pruned from GLM-4.5, near-lossless, lots of room for context.

Links to download each model are at the bottom.

Understanding the Constraints

Before we talk about models, you need to understand what 96GB actually buys you. Unified memory on Apple Silicon is both system RAM and GPU VRAM — no PCIe transfers, no separate pool. That's its superpower. But the math is unforgiving:

Quantization	Bytes per param	30B model	100B model	284B model (DS4 Flash)
FP8 / Q8_0	1 byte	30 GB	100 GB	284 GB ❌
Q6_K	~0.75 bytes	22.5 GB	75 GB	213 GB ❌
Q5_K_M	~0.63 bytes	18.9 GB	63 GB	179 GB ❌
Q4_K_M	~0.5 bytes	15 GB ✅	50 GB ✅	142 GB ❌
Q3_K_M	~0.38 bytes	11.4 GB	38 GB	108 GB ❌
Q2_K / IQ2_M	~0.25 bytes	7.5 GB	25 GB	71 GB ✅

The key insight: A 284B MoE at Q4 needs ~142 GB — 48% more than 96GB provides. Even at Q3 (~108 GB), it's over the limit. Only at Q2 (~71 GB) does DS4 Flash fit, but Q2 on a model this big is... not great. The quality cliff for Q2 on MoE architectures is real — you lose the nuance that makes the model worth running.

So the strategy isn't "cram DS4 Flash into 96GB at Q2." The strategy is find models with better active-param-per-GB ratios that fit at Q4 or Q5, where quality is preserved.

What "Better Than DS4 Flash" Actually Means

DeepSeek V4 Flash has 284B total parameters but only 13B active per token. MoE models work by routing each token through a subset of experts. The active-param count determines per-token compute capacity — how much thinking happens per word. The total param count determines parametric knowledge — how much is memorized in weights.

For most tasks, active params matter more than total params. A model with 17B active and 109B total (Llama 4 Scout) can be more capable per token than one with 13B active and 284B total (DS4 Flash), especially when both are at Q4. The extra total params in DS4 Flash represent knowledge breadth, but the active-param ceiling limits per-token reasoning.

So the bar is: ≥13B active params, ≥80B total params, Q4 or better quantization, fits in 96GB with room for context.

REAP: Router-weighted Expert Activation Pruning

Cerebras Research published REAP (accepted to ICLR 2026) — a method for compressing MoE models by pruning the least-used experts. Rather than merging experts (which causes "functional subspace collapse"), REAP identifies and removes experts that contribute least to the output, then fine-tunes the remaining experts and router to recover quality.

The results are impressive: near-lossless compression at 50% expert removal on Qwen3-Coder-480B and Kimi-K2. For our purposes, REAP creates models with fewer total parameters but the same active-param count — exactly what we need for fitting into 96GB.

Cerebras released a full collection on HuggingFace: cerebras/cerebras-reap.

REAP Models for 96GB

Model	Total	Active	Q4 Size	Fits 96GB?	Notes
GLM-4.5-Air-REAP-82B-A12B	82B	12B	~41 GB	✅ Lots of room	Pruned from GLM-4.5-Air. Best REAP fit.
GLM-4.5-Air-REAP-82B-A12B-FP8	82B	12B	~82 GB	✅ Tight	FP8 version, near-lossless. Less room for ctx.
GLM-4.6-REAP-218B-A32B-FP8	218B	32B	~109 GB (Q4)	❌ Q2 only	32B active is amazing, but doesn't fit at good quants.
GLM-4.6-REAP-252B-A32B-FP8	252B	32B	~126 GB (Q4)	❌	Even larger variant.
Qwen3-Coder-REAP-25B-A3B	25B	3B	~13 GB	✅ Tiny	Code-specific. Too small to compete with DS4 Flash.
Qwen3-Coder-REAP-246B-A35B	246B	35B	~123 GB (Q4)	❌	Code-specific. Doesn't fit at usable quants.
Qwen3-Coder-REAP-363B-A35B	363B	35B	~182 GB (Q4)	❌	Even larger.

GLM-4.5-Air-REAP-82B-A12B is the only REAP model that clears the bar: 12B active (near DS4 Flash's 13B), fits at Q4 with 55GB leftover for context cache. The FP8 variant is tighter but gives you lossless weight precision if you're willing to sacrifice context length.

The bigger REAP models (GLM-4.6-REAP with 32B active) are tantalizing — 32B active would handily beat DS4 Flash — but they only fit at Q2 or IQ2, which defeats the purpose.

IMatrix GGUF Quantizations: The Community Standard

While REAP is a pruning technique (reducing the model itself), IMatrix is a quantization calibration method that improves low-bit quality. Created by the llama.cpp community, an imatrix calibration file collects activation statistics from a representative dataset and uses them to allocate bits more intelligently across layers during quantization.

The two main publishers of imatrix GGUFs are Bartowski and Unsloth. Unsloth's Dynamic GGUF quants (the "UD-" prefix) use an improved imatrix calibration dataset and consistently achieve better KL-divergence at the same file size compared to standard K-quants.

Available IMatrix Models That Fit 96GB

Model	Total	Active	Best Quant	File Size	Context	Source
Llama 4 Scout	109B	17B	Q4_K_M	67.55 GB	10M tokens	Bartowski
Qwen3.5-122B-A10B	122B	10B	Q4_K_M	77.62 GB	256K tokens	Bartowski / Unsloth
GLM-4.5-Air-REAP-82B-A12B	82B	12B	Q4_K_M	~41 GB	256K tokens	Cerebras (official)
Qwen3.5-35B-A3B	35B	3B	Q4_K_M	~22 GB	256K tokens	Bartowski / Unsloth
DeepSeek V4 Flash	284B	13B	Q3	~60 GB	—	Community (experimental)
DeepSeek V4 Flash	284B	13B	Q4	~80 GB	—	Community (experimental)

Deep Dive: Llama 4 Scout

Meta's Llama 4 Scout is 109B total with 17B active params — 16 experts, with 2 active per token. At Q4_K_M (67.55 GB), it fits comfortably in 96GB with 28GB left for KV cache. At 17B active, it beats DS4 Flash's 13B by ~30%. The 10M token context window is architectural, not a gimmick — it uses interleaved grouped-query attention with RoPE frequency scaling.

On benchmarks, Llama 4 Scout competes with GPT-4o class models. The 17B active params give it serious per-token reasoning capacity. For local inference on a Mac Studio, expect ~15-25 tok/s depending on quantization and context length — plenty fast for interactive use.

Download: huggingface-cli download bartowski/meta-llama_Llama-4-Scout-17B-16E-Instruct-GGUF --include "Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf" --local-dir ./models

Deep Dive: Qwen3.5-122B-A10B

Qwen3.5-122B-A10B is the sweet spot. 122B total, 10B active, and Qwen's architecture is exceptionally well-optimized. At Q4_K_M (77.62 GB), it fits with 18GB for context — enough for 128K+ token windows. 10B active is slightly less than DS4 Flash's 13B, but Qwen3.5's training quality is top-tier — many benchmark comparisons put it at or above Claude Sonnet 4.5 level for coding and reasoning.

The 122B total means it has a massive knowledge base. For retrieval-heavy tasks, factual questions, or long-context reasoning, the extra total params matter. The tradeoff vs Llama 4 Scout is: more total knowledge (122B vs 109B) but fewer active params per token (10B vs 17B).

Download: huggingface-cli download bartowski/Qwen_Qwen3.5-122B-A10B-GGUF --include "Qwen3.5-122B-A10B-Q4_K_M.gguf" --local-dir ./models

Deep Dive: GLM-4.5-Air-REAP-82B-A12B

This is the REAP experiment worth running. GLM-4.5-Air is ZhiPu's strong 20B-A3B MoE, and the REAP-pruned version compresses it to 82B total while keeping 12B active. At Q4_K_M (~41 GB), it leaves a massive 55GB for context — you can run 1M+ token windows without breaking a sweat.

The REAP paper showed that GLM-4.5-Air-REAP-82B achieves near-lossless performance compared to the unpruned GLM-4.5-Air-148B. So you're getting essentially the same quality in half the memory. 12B active is close enough to DS4 Flash's 13B that quality differences will be marginal.

Download: huggingface-cli download cerebras/GLM-4.5-Air-REAP-82B-A12B --local-dir ./models

Note: the REAP models from Cerebras are not in GGUF format — they're standard HF checkpoints. You'll need to convert to GGUF using llama.cpp's convert script, or run them via llama.cpp with the HF loader directly.

What About DeepSeek V4 Flash Locally?

Can you run DS4 Flash on 96GB? Technically yes, at Q2 quantization (~40-71 GB depending on method). But I wouldn't recommend it. Q2 on a 284B MoE is a significant quality regression. The model was trained for FP8 inference — the quantized version loses the nuance that makes it competitive. You'll get worse results than running a natively smaller model at Q4.

There are experimental GGUF builds for DS4 Flash in the community (llama.cpp issue #22319), but support is early. Inference speed on Mac Studio's 400 GB/s bandwidth will be constrained by the massive weight size even at low quants — expect 5-10 tok/s.

Verdict: Wait for Q3 or Q4 community quants to mature, or accept Q2 quality. For now, Llama 4 Scout or Qwen3.5-122B are better options.

Performance Expectations on Mac Studio (M3 Ultra)

The M3 Ultra has 400 GB/s memory bandwidth. For MoE models, throughput is primarily bandwidth-bound, not compute-bound. Here's what you can expect:

Model	Quant	Weight Size	Estimated Tok/s	Max Context (with headroom)
Llama 4 Scout	Q4_K_M	67.55 GB	~18-25 tok/s	~256K tokens
Qwen3.5-122B-A10B	Q4_K_M	77.62 GB	~15-20 tok/s	~128K tokens
GLM-4.5-Air-REAP-82B-A12B	Q4_K_M	~41 GB	~25-35 tok/s	~512K tokens
Qwen3.5-35B-A3B	Q4_K_M	~22 GB	~40-55 tok/s	~1M tokens
DeepSeek V4 Flash	Q2	~71 GB	~5-10 tok/s	~64K tokens

These are estimates based on bandwidth calculations. Real-world performance depends on prompt processing (which is compute-heavy), context length, and batching. For single-user interactive use, even 15 tok/s is plenty fast.

How to Actually Run These

All of these models work with llama.cpp and its OpenAI-compatible server. On macOS, the recommended setup is:

# Install llama.cpp via brew
brew install llama.cpp

# Run a server with flash attention and 32K context
llama-server \
  -hf bartowski/Qwen_Qwen3.5-122B-A10B-GGUF:Q4_K_M \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --flash-attn on \
  --ctx-size 32768 \
  --port 8080

For Ollama users, most of these models have Ollama tags available:

# Qwen3.5-122B-A10B
ollama pull qwen3.5:122b-a10b-q4_K_M

# Llama 4 Scout
ollama pull llama4-scout

# Qwen3.5-35B-A3B
ollama pull qwen3.5:35b-a3b-q4_K_M

For REAP models in HF format, you can convert to GGUF or use llama.cpp's native HF loader:

# Convert REAP model to GGUF
python3 convert-hf-to-gguf.py \
  --ckpt-dir ./GLM-4.5-Air-REAP-82B-A12B \
  --out-file ./GLM-4.5-Air-REAP-82B-A12B-Q4.gguf \
  --quantize Q4_K_M

Putting It All Together

If you have a 96GB Mac Studio arriving Monday (congrats), here's my recommended priority list:

Start with Llama 4 Scout Q4_K_M (67.55 GB, 17B active). This is the single best model for the hardware. Download from Bartowski's HF page.
Also grab Qwen3.5-122B-A10B Q4_K_M (77.62 GB, 10B active). Keep it as a second option for tasks where Qwen's training quality matters more than active-param count. Swap between them based on the task.
Experiment with GLM-4.5-Air-REAP-82B-A12B if you want to see what REAP pruning can do. The 12B active and 55GB of spare memory make it ideal for long-context experiments.
Skip DeepSeek V4 Flash locally for now. Wait for proper Q3/Q4 community quants or accept API usage for that model.

Quick Reference: Model Downloads

Model	HF Path	File to Download
Llama 4 Scout Q4_K_M	`bartowski/meta-llama_Llama-4-Scout-17B-16E-Instruct-GGUF`	`Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf`
Qwen3.5-122B-A10B Q4_K_M	`bartowski/Qwen_Qwen3.5-122B-A10B-GGUF`	`Qwen3.5-122B-A10B-Q4_K_M.gguf`
Qwen3.5-35B-A3B Q4_K_M	`unsloth/Qwen3.5-35B-A3B-GGUF`	`Qwen3.5-35B-A3B-Q4_K_M.gguf`
GLM-4.5-Air-REAP-82B-A12B	`cerebras/GLM-4.5-Air-REAP-82B-A12B`	HF checkpoint (convert to GGUF)

Beyond Models: Optimizing Inference

A few tips to maximize your Mac Studio's potential:

Use flash attention. llama.cpp's --flash-attn on dramatically reduces memory usage for long contexts. Always enable it.
Match KV cache quantization to model quantization. Use --cache-type-k q4_0 --cache-type-v q4_0 for Q4 models. This keeps cache memory proportional to model memory.
Metal performance. macOS Sonoma+ has excellent Metal performance for llama.cpp. Make sure you're using a recent build with Metal support (llama.cpp --help should show Metal as available).
Batch size matters. For single-user chat, --batch-size 512 --ubatch-size 512 is usually optimal. Higher batch sizes increase prompt processing speed but use more memory.
Consider MLX. For Apple Silicon, MLX can sometimes outperform llama.cpp on prompt processing. Models like Qwen3.5 have MLX quantizations available. But llama.cpp is more universally supported and has better community quant coverage.

Final Thoughts

96GB unified memory is a remarkable amount of space, but it's not unlimited. The temptation to cram the biggest model possible at the lowest quantization is strong — resist it. A 284B model at Q2 will feel dumber than a 109B model at Q4, even though the total param count is higher. Quantization quality matters as much as parameter count.

REAP is a genuinely promising compression technique, and the Cerebras collection is worth watching. But for immediate use, the community imatrix GGUFs from Bartowski and Unsloth are more practical — they're battle-tested, available now, and the quantizations are optimized for real-world use.

Llama 4 Scout Q4_K_M is the king of 96GB. 17B active, 10M context, fits with room to spare. If you want one model that beats what you're paying for on the API, this is it.

References

• Cerebras REAP paper: arxiv.org/abs/2510.13999

• Cerebras REAP collection: huggingface.co/collections/cerebras/cerebras-reap

• Cerebras REAP repo: github.com/CerebrasResearch/reap

• Bartowski GGUFs: huggingface.co/bartowski

• Unsloth Qwen3.5 GGUFs: huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

• DeepSeek V4 VRAM estimates: knightli.com

• Llama 4 Scout GGUF: bartowski/Llama-4-Scout-GGUF

• Qwen3.5-122B GGUF: bartowski/Qwen3.5-122B-A10B-GGUF

• llama.cpp DS4 Flash tracking: github.com/ggml-org/llama.cpp/issues/22319