Running Frontier LLMs on a 96GB Mac Studio:
REAP, IMatrix, and the Models That Beat DeepSeek V4 Flash

May 25, 2026 · ~15 min read · local-llm mac-studio gguf quantization

The 96GB Question

The M3 Ultra Mac Studio with 96GB of unified memory is a beast. 400 GB/s of bandwidth, a massive shared pool that CPU and GPU both access at full speed. But when you sit down to run a serious LLM locally, you quickly hit a wall: what models actually fit, and which ones are better than what you're paying for on the API?

If you're coming from DeepSeek V4 Flash — 284B total parameters, 13B active, served at FP8 quality — you know the bar is high. Reaching or exceeding that quality from a single Mac Studio requires careful model selection. This guide covers everything I found: REAP-pruned MoE models from Cerebras Research, IMatrix GGUF quantizations from the community, and the concrete math of what fits.

TL;DR

Llama 4 Scout Q4_K_M (67.55 GB, 17B active params) is the single best model for 96GB. 17B active beats DS4 Flash's 13B, it has 10M native context, and it leaves 28GB for KV cache.

Qwen3.5-122B-A10B Q4_K_M (77.62 GB, 10B active) is the runner-up — slightly fewer active params than DS4 Flash, but Qwen3.5's architecture is excellent and 122B total params means deep knowledge.

GLM-4.5-Air-REAP-82B-A12B Q4 (~41 GB, 12B active) is the REAP experiment worth trying — pruned from GLM-4.5, near-lossless, lots of room for context.

Links to download each model are at the bottom.

Understanding the Constraints

Before we talk about models, you need to understand what 96GB actually buys you. Unified memory on Apple Silicon is both system RAM and GPU VRAM — no PCIe transfers, no separate pool. That's its superpower. But the math is unforgiving:

QuantizationBytes per param30B model100B model284B model (DS4 Flash)
FP8 / Q8_01 byte30 GB100 GB284 GB ❌
Q6_K~0.75 bytes22.5 GB75 GB213 GB ❌
Q5_K_M~0.63 bytes18.9 GB63 GB179 GB ❌
Q4_K_M~0.5 bytes15 GB ✅50 GB ✅142 GB ❌
Q3_K_M~0.38 bytes11.4 GB38 GB108 GB ❌
Q2_K / IQ2_M~0.25 bytes7.5 GB25 GB71 GB ✅

The key insight: A 284B MoE at Q4 needs ~142 GB — 48% more than 96GB provides. Even at Q3 (~108 GB), it's over the limit. Only at Q2 (~71 GB) does DS4 Flash fit, but Q2 on a model this big is... not great. The quality cliff for Q2 on MoE architectures is real — you lose the nuance that makes the model worth running.

So the strategy isn't "cram DS4 Flash into 96GB at Q2." The strategy is find models with better active-param-per-GB ratios that fit at Q4 or Q5, where quality is preserved.

What "Better Than DS4 Flash" Actually Means

DeepSeek V4 Flash has 284B total parameters but only 13B active per token. MoE models work by routing each token through a subset of experts. The active-param count determines per-token compute capacity — how much thinking happens per word. The total param count determines parametric knowledge — how much is memorized in weights.

For most tasks, active params matter more than total params. A model with 17B active and 109B total (Llama 4 Scout) can be more capable per token than one with 13B active and 284B total (DS4 Flash), especially when both are at Q4. The extra total params in DS4 Flash represent knowledge breadth, but the active-param ceiling limits per-token reasoning.

So the bar is: ≥13B active params, ≥80B total params, Q4 or better quantization, fits in 96GB with room for context.

REAP: Router-weighted Expert Activation Pruning

Cerebras Research published REAP (accepted to ICLR 2026) — a method for compressing MoE models by pruning the least-used experts. Rather than merging experts (which causes "functional subspace collapse"), REAP identifies and removes experts that contribute least to the output, then fine-tunes the remaining experts and router to recover quality.

The results are impressive: near-lossless compression at 50% expert removal on Qwen3-Coder-480B and Kimi-K2. For our purposes, REAP creates models with fewer total parameters but the same active-param count — exactly what we need for fitting into 96GB.

Cerebras released a full collection on HuggingFace: cerebras/cerebras-reap.

REAP Models for 96GB

ModelTotalActiveQ4 SizeFits 96GB?Notes
GLM-4.5-Air-REAP-82B-A12B 82B 12B ~41 GB ✅ Lots of room Pruned from GLM-4.5-Air. Best REAP fit.
GLM-4.5-Air-REAP-82B-A12B-FP8 82B 12B ~82 GB ✅ Tight FP8 version, near-lossless. Less room for ctx.
GLM-4.6-REAP-218B-A32B-FP8 218B 32B ~109 GB (Q4) ❌ Q2 only 32B active is amazing, but doesn't fit at good quants.
GLM-4.6-REAP-252B-A32B-FP8 252B 32B ~126 GB (Q4) Even larger variant.
Qwen3-Coder-REAP-25B-A3B 25B 3B ~13 GB ✅ Tiny Code-specific. Too small to compete with DS4 Flash.
Qwen3-Coder-REAP-246B-A35B 246B 35B ~123 GB (Q4) Code-specific. Doesn't fit at usable quants.
Qwen3-Coder-REAP-363B-A35B 363B 35B ~182 GB (Q4) Even larger.

GLM-4.5-Air-REAP-82B-A12B is the only REAP model that clears the bar: 12B active (near DS4 Flash's 13B), fits at Q4 with 55GB leftover for context cache. The FP8 variant is tighter but gives you lossless weight precision if you're willing to sacrifice context length.

The bigger REAP models (GLM-4.6-REAP with 32B active) are tantalizing — 32B active would handily beat DS4 Flash — but they only fit at Q2 or IQ2, which defeats the purpose.

IMatrix GGUF Quantizations: The Community Standard

While REAP is a pruning technique (reducing the model itself), IMatrix is a quantization calibration method that improves low-bit quality. Created by the llama.cpp community, an imatrix calibration file collects activation statistics from a representative dataset and uses them to allocate bits more intelligently across layers during quantization.

The two main publishers of imatrix GGUFs are Bartowski and Unsloth. Unsloth's Dynamic GGUF quants (the "UD-" prefix) use an improved imatrix calibration dataset and consistently achieve better KL-divergence at the same file size compared to standard K-quants.

Available IMatrix Models That Fit 96GB

ModelTotalActiveBest QuantFile SizeContextSource
Llama 4 Scout 109B 17B Q4_K_M 67.55 GB 10M tokens Bartowski
Qwen3.5-122B-A10B 122B 10B Q4_K_M 77.62 GB 256K tokens Bartowski / Unsloth
GLM-4.5-Air-REAP-82B-A12B 82B 12B Q4_K_M ~41 GB 256K tokens Cerebras (official)
Qwen3.5-35B-A3B 35B 3B Q4_K_M ~22 GB 256K tokens Bartowski / Unsloth
DeepSeek V4 Flash 284B 13B Q3 ~60 GB Community (experimental)
DeepSeek V4 Flash 284B 13B Q4 ~80 GB Community (experimental)

Deep Dive: Llama 4 Scout

Meta's Llama 4 Scout is 109B total with 17B active params — 16 experts, with 2 active per token. At Q4_K_M (67.55 GB), it fits comfortably in 96GB with 28GB left for KV cache. At 17B active, it beats DS4 Flash's 13B by ~30%. The 10M token context window is architectural, not a gimmick — it uses interleaved grouped-query attention with RoPE frequency scaling.

On benchmarks, Llama 4 Scout competes with GPT-4o class models. The 17B active params give it serious per-token reasoning capacity. For local inference on a Mac Studio, expect ~15-25 tok/s depending on quantization and context length — plenty fast for interactive use.

Download: huggingface-cli download bartowski/meta-llama_Llama-4-Scout-17B-16E-Instruct-GGUF --include "Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf" --local-dir ./models

Deep Dive: Qwen3.5-122B-A10B

Qwen3.5-122B-A10B is the sweet spot. 122B total, 10B active, and Qwen's architecture is exceptionally well-optimized. At Q4_K_M (77.62 GB), it fits with 18GB for context — enough for 128K+ token windows. 10B active is slightly less than DS4 Flash's 13B, but Qwen3.5's training quality is top-tier — many benchmark comparisons put it at or above Claude Sonnet 4.5 level for coding and reasoning.

The 122B total means it has a massive knowledge base. For retrieval-heavy tasks, factual questions, or long-context reasoning, the extra total params matter. The tradeoff vs Llama 4 Scout is: more total knowledge (122B vs 109B) but fewer active params per token (10B vs 17B).

Download: huggingface-cli download bartowski/Qwen_Qwen3.5-122B-A10B-GGUF --include "Qwen3.5-122B-A10B-Q4_K_M.gguf" --local-dir ./models

Deep Dive: GLM-4.5-Air-REAP-82B-A12B

This is the REAP experiment worth running. GLM-4.5-Air is ZhiPu's strong 20B-A3B MoE, and the REAP-pruned version compresses it to 82B total while keeping 12B active. At Q4_K_M (~41 GB), it leaves a massive 55GB for context — you can run 1M+ token windows without breaking a sweat.

The REAP paper showed that GLM-4.5-Air-REAP-82B achieves near-lossless performance compared to the unpruned GLM-4.5-Air-148B. So you're getting essentially the same quality in half the memory. 12B active is close enough to DS4 Flash's 13B that quality differences will be marginal.

Download: huggingface-cli download cerebras/GLM-4.5-Air-REAP-82B-A12B --local-dir ./models

Note: the REAP models from Cerebras are not in GGUF format — they're standard HF checkpoints. You'll need to convert to GGUF using llama.cpp's convert script, or run them via llama.cpp with the HF loader directly.

What About DeepSeek V4 Flash Locally?

Can you run DS4 Flash on 96GB? Technically yes, at Q2 quantization (~40-71 GB depending on method). But I wouldn't recommend it. Q2 on a 284B MoE is a significant quality regression. The model was trained for FP8 inference — the quantized version loses the nuance that makes it competitive. You'll get worse results than running a natively smaller model at Q4.

There are experimental GGUF builds for DS4 Flash in the community (llama.cpp issue #22319), but support is early. Inference speed on Mac Studio's 400 GB/s bandwidth will be constrained by the massive weight size even at low quants — expect 5-10 tok/s.

Verdict: Wait for Q3 or Q4 community quants to mature, or accept Q2 quality. For now, Llama 4 Scout or Qwen3.5-122B are better options.

Performance Expectations on Mac Studio (M3 Ultra)

The M3 Ultra has 400 GB/s memory bandwidth. For MoE models, throughput is primarily bandwidth-bound, not compute-bound. Here's what you can expect:

ModelQuantWeight SizeEstimated Tok/sMax Context (with headroom)
Llama 4 ScoutQ4_K_M67.55 GB~18-25 tok/s~256K tokens
Qwen3.5-122B-A10BQ4_K_M77.62 GB~15-20 tok/s~128K tokens
GLM-4.5-Air-REAP-82B-A12BQ4_K_M~41 GB~25-35 tok/s~512K tokens
Qwen3.5-35B-A3BQ4_K_M~22 GB~40-55 tok/s~1M tokens
DeepSeek V4 FlashQ2~71 GB~5-10 tok/s~64K tokens

These are estimates based on bandwidth calculations. Real-world performance depends on prompt processing (which is compute-heavy), context length, and batching. For single-user interactive use, even 15 tok/s is plenty fast.

How to Actually Run These

All of these models work with llama.cpp and its OpenAI-compatible server. On macOS, the recommended setup is:

# Install llama.cpp via brew
brew install llama.cpp

# Run a server with flash attention and 32K context
llama-server \
  -hf bartowski/Qwen_Qwen3.5-122B-A10B-GGUF:Q4_K_M \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --flash-attn on \
  --ctx-size 32768 \
  --port 8080

For Ollama users, most of these models have Ollama tags available:

# Qwen3.5-122B-A10B
ollama pull qwen3.5:122b-a10b-q4_K_M

# Llama 4 Scout
ollama pull llama4-scout

# Qwen3.5-35B-A3B
ollama pull qwen3.5:35b-a3b-q4_K_M

For REAP models in HF format, you can convert to GGUF or use llama.cpp's native HF loader:

# Convert REAP model to GGUF
python3 convert-hf-to-gguf.py \
  --ckpt-dir ./GLM-4.5-Air-REAP-82B-A12B \
  --out-file ./GLM-4.5-Air-REAP-82B-A12B-Q4.gguf \
  --quantize Q4_K_M

Putting It All Together

If you have a 96GB Mac Studio arriving Monday (congrats), here's my recommended priority list:

  1. Start with Llama 4 Scout Q4_K_M (67.55 GB, 17B active). This is the single best model for the hardware. Download from Bartowski's HF page.
  2. Also grab Qwen3.5-122B-A10B Q4_K_M (77.62 GB, 10B active). Keep it as a second option for tasks where Qwen's training quality matters more than active-param count. Swap between them based on the task.
  3. Experiment with GLM-4.5-Air-REAP-82B-A12B if you want to see what REAP pruning can do. The 12B active and 55GB of spare memory make it ideal for long-context experiments.
  4. Skip DeepSeek V4 Flash locally for now. Wait for proper Q3/Q4 community quants or accept API usage for that model.

Quick Reference: Model Downloads

ModelHF PathFile to Download
Llama 4 Scout Q4_K_M bartowski/meta-llama_Llama-4-Scout-17B-16E-Instruct-GGUF Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf
Qwen3.5-122B-A10B Q4_K_M bartowski/Qwen_Qwen3.5-122B-A10B-GGUF Qwen3.5-122B-A10B-Q4_K_M.gguf
Qwen3.5-35B-A3B Q4_K_M unsloth/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf
GLM-4.5-Air-REAP-82B-A12B cerebras/GLM-4.5-Air-REAP-82B-A12B HF checkpoint (convert to GGUF)

Beyond Models: Optimizing Inference

A few tips to maximize your Mac Studio's potential:

Final Thoughts

96GB unified memory is a remarkable amount of space, but it's not unlimited. The temptation to cram the biggest model possible at the lowest quantization is strong — resist it. A 284B model at Q2 will feel dumber than a 109B model at Q4, even though the total param count is higher. Quantization quality matters as much as parameter count.

REAP is a genuinely promising compression technique, and the Cerebras collection is worth watching. But for immediate use, the community imatrix GGUFs from Bartowski and Unsloth are more practical — they're battle-tested, available now, and the quantizations are optimized for real-world use.

Llama 4 Scout Q4_K_M is the king of 96GB. 17B active, 10M context, fits with room to spare. If you want one model that beats what you're paying for on the API, this is it.

References

• Cerebras REAP paper: arxiv.org/abs/2510.13999

• Cerebras REAP collection: huggingface.co/collections/cerebras/cerebras-reap

• Cerebras REAP repo: github.com/CerebrasResearch/reap

• Bartowski GGUFs: huggingface.co/bartowski

• Unsloth Qwen3.5 GGUFs: huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

• DeepSeek V4 VRAM estimates: knightli.com

• Llama 4 Scout GGUF: bartowski/Llama-4-Scout-GGUF

• Qwen3.5-122B GGUF: bartowski/Qwen3.5-122B-A10B-GGUF

• llama.cpp DS4 Flash tracking: github.com/ggml-org/llama.cpp/issues/22319