SLMs vs LLMs: Sizing Models in a Fast-Moving Landscape
New models ship weekly and MoE (Mixture of Experts) makes parameter counts misleading. A visual guide to navigating model sizes — and knowing when to go small, go big, or route between both.
Feb 21, 2026
Size Taxonomy
The Model Landscape
Where models actually sit on the parameter scale — and why MoE makes it complicated.
These models store hundreds of billions of parameters but only activate a small fraction per token. The bright segment shows what's actually used during inference.
✓ Significantly cheaper at scale✓ Own your weights & data✓ Runs on-device / air-gapped× Requires SFT data & training pipeline× More ops complexity vs. 3P API
↓
What about MoE — is that big or small?
↓
Mixture of Experts
The MoE Complication
MoE models have up to 1T total parameters but only activate 17–37B per token. So are they big or small?
In a dense model, every parameter fires on every token. MoE splits the network into dozens of specialist sub-networks (“experts”) and uses a lightweight router to pick a small handful for each token. The result: frontier-level quality at a fraction of the compute, because most of the model stays asleep.
Try it — route a token through the pipeline and watch which experts activate.
Token In
input embedding
↓
Attention
shared layers
↓
Router
selects experts
↓
↓
Output
shared projection
↓
Token Out
next token
1TTotal (Kimi K2)
32BActive per token
3.2%Utilization
↓
So how do you actually decide?
↓
Decision Framework
Match the Signal to the Size
Nine questions that tell you whether to go big or small. Click each to see why.
Is the task well-defined?
Go Small›
Narrow, well-scoped tasks (classification, extraction, routing) are SLM territory. Fine-tuning on your specific labels consistently beats zero-shot LLMs.
Do you have labeled training data?
Go Small›
Even 500–1K labeled examples can push a 1–3B model past a 400B generalist. No data means you need the LLM's zero-shot ability.
Is query volume high?
Go Small›
At 10K+ queries/day, the cost gap becomes existential. A fine-tuned SLM at $0.01/1K tokens vs $1.00/1K tokens compounds fast.
Do you need real-time latency?
Go Small›
SLMs generate 150–300 tok/s vs 50–100 tok/s for LLMs. For user-facing applications, that latency gap defines the entire UX.
Is the deployment target edge or mobile?
Go Small›
Models under 4B parameters can run on-device with quantization. No network round-trip, no cloud dependency, works fully offline.
Is budget constrained?
Go Small›
Fine-tuning a 3B model costs ~$50–200. Running it costs 100x less than an LLM API. The ROI is immediate for well-defined tasks.
Does data need to stay on-premise?
Go Small›
Self-hosted SLMs mean no data leaves your infrastructure. For healthcare, finance, and legal this is often a hard requirement, not a preference.
Does the system handle diverse, unpredictable tasks?
Go Big›
If you can't enumerate the task space — customer service bots, coding assistants, research tools — you need the LLM's broad generalization ability.
Does the task require multi-step reasoning?
Go Big›
Chain-of-thought, planning, and complex inference still favor large models. SLMs struggle with problems requiring 3+ reasoning steps or compositional logic.
Your signals:0 Big|0 Small
↓
In practice, you don't pick one — you route between both.
↓
Routing Playbook
Route, Don't Choose
A small router model classifies each request, sending simple queries to the SLM and hard ones to the LLM.
Complex query ratio20%
SLM Path
Fast, cheap — handles ~80% of traffic
0
LLM Path
Powerful — handles ~20% of complex queries
0
$0.00
With routing
$0.00
All-LLM baseline
100xSLM cost advantage
$0.01SLM / 1K tokens
$1.00LLM / 1K tokens
↓
The bottom line
↓
Key Takeaways
What to Remember
A fine-tuned 3B model will beat a zero-shot 400B model on a well-scoped task. The benchmarks are clear on this.
MoE makes “how big is this model?” a harder question. Always check active parameters, not just total.
LLMs still win on open-ended reasoning, long context, and tasks you can’t define upfront. Don’t force-fit an SLM where the task is ambiguous.
Routing is a practical middle ground: classify requests by complexity, send the easy ones to a small model, and save the expensive model for what actually needs it.
Prototype with an LLM API first. Once the task stabilizes and you have labeled data, distill down to an SLM you own.
Last updated Feb 21, 2026. Models change fast — the decision framework doesn’t.