Methodology · v0.1
How Mintok turns a workload into a cost.
Every chip count, every $/M-token, every recommendation in a Mintok brief comes from formulas you can read on this page. No proprietary "AI optimization" black box, no tunable fudge factors that vary by customer. Just physics — silicon FLOPS, HBM bandwidth, electricity prices — and clearly-named constants we publish here and don't change behind your back.
If you find a formula that doesn't match the code, that's a bug. Tell us and we'll fix it.
0. Overview & principles
Mintok takes a Workload Spec (how many tokens, how often, how fast) and produces three artifacts: a Sizing recommendation (how many chips, of which kind), a Cost projection ($/M-token across relevant scenarios), and an Engagement Brief that synthesizes both into a customer-readable doc.
Each artifact has a clear formula. The full chain is:
Workload Spec ──▶ Three-constraint solver ──▶ Recommended chip count (N)
│ │
▼ ▼
Binding constraint label Throughput (tokens/sec)
│ │
└─────────────┐ ┌──────────┘
▼ ▼
Cost model (CapEx + Power + OpEx)
│
▼
$/M-tokenFive transparency commitments
- No hidden formulas. Every number in your brief traces back to an equation on this page.
- No per-customer fudge factors. The same constants apply to every engagement. If we change one, we update this doc and bump the version.
- Physics first, heuristics labeled. Where we use a rule of thumb (e.g., 30% server overhead), we name it and explain why.
- Sensitivity over point estimates. Where an input is genuinely uncertain (agent-calls-per-event), we render a range.
- Versioned. This is v0.1. When the methodology changes, the version on your brief stays pinned to the version that produced it. You can always cite exactly which methodology saw your workload.
1. The Workload Spec
The Spec captures what your workload does, separate from how it's implemented. Each field is a physical quantity with a unit. We don't store anyone's interpretation; we store the numbers.
Templates pre-fill defaults per archetype (chatbot, RAG, etc.) so you start from a sensible baseline. Every field is overridable; the brief records both the source (template default vs explicit FDE input) and which numbers the customer reviewed.
Use-case shape
| field | unit | what it captures | flows to |
|---|---|---|---|
| archetype | — | Discrete category (CHATBOT, RAG, AGENTIC_TOOL_USE, BATCH_INFERENCE, CODE_GENERATION, DOCUMENT_ANALYSIS, CUSTOM). Selects a template carrying default Tier-3 priors so the FDE doesn't start blank. | Spec defaults · brief narrative |
| qualityTier | — | production_external (paying customers see it), production_internal (employees see it), or poc. Used to recommend model size and SLA assumptions. | Model recommendation · narrative |
| latencyMode | — | interactive_streaming, interactive_nonstreaming, or batch. Streams have strict per-token deadlines; batch has none. | p95LatencyMs target · sizing margin |
Volume
| field | unit | what it captures | flows to |
|---|---|---|---|
| eventsPerDayMean | events/day | Average daily user interactions or jobs. An "event" is one user-visible request — a chat turn, a document analysis, an agent task. | Annual tokens · QPS derivation |
| eventsPerDayPeakQps | events/s | Peak burst rate the cluster must serve without queueing. Drives compute headroom; for chatty workloads, peak/mean is often 3–10×. | Compute-constraint chip count (Sizing tab) |
Tokens per event
| field | unit | what it captures | flows to |
|---|---|---|---|
| tokensInMean | tokens | Average input length per event — user prompt + retrieved context + system prompt. Determines prefill cost. | KV-cache size · throughput math |
| tokensInP95 | tokens | p95 input length. Important because KV-cache is allocated for the full prompt, not the mean. Sizing uses p95 to avoid OOM under tail load. | Memory-constraint chip count |
| tokensOutMean | tokens | Average generated tokens per event. Decode is autoregressive — each output token is a forward pass. | Total tokens/sec target · annual tokens |
| tokensOutP95 | tokens | p95 output length. Tail-bounds latency: a slow-generating long response can starve other concurrent users. | Latency-mode margin |
Context window
| field | unit | what it captures | flows to |
|---|---|---|---|
| contextTypical | tokens | Typical total context (input + accumulated history). Sets the per-sequence KV-cache footprint at steady-state. | KV-cache size (S in formulas below) |
| contextMax | tokens | Worst-case context the system must support. Sizing must hold this in HBM concurrently with the model weights. | Memory-constraint chip count |
Latency
| field | unit | what it captures | flows to |
|---|---|---|---|
| p95LatencyMs | ms | Target end-to-end p95 latency for streaming workloads (time-to-first-token + total-output-time). Looser targets allow more batching and lower cost-per-token. | Batch-size feasibility · MFU achievable |
Agent calls (the unknown unknown)
| field | unit | what it captures | flows to |
|---|---|---|---|
| agentCallsMin | calls/event | Low-bound estimate of LLM calls per business event in an agentic workflow (chain of thought, tool use, retries). Cost projections are rendered as a SENSITIVITY range, not a point estimate, because this value is the largest scoping uncertainty in most engagements. | Cost lower bound · sensitivity tables |
| agentCallsMax | calls/event | High-bound estimate. The gap between min and max drives the cost-range width customers see in the brief. | Cost upper bound · sensitivity tables |
Caching & growth
| field | unit | what it captures | flows to |
|---|---|---|---|
| cacheHitPct | % | Fraction of input tokens that hit the prefix/KV cache (prompt caching, retrieval-augmented prompts with stable headers). Each percentage point materially reduces prefill compute. | Effective tokens-in for cost |
| growth3MoPct | % | Expected volume growth at 3 months relative to launch (e.g., 50 = +50%). Compounds with later windows. | Sizing horizon · TCO |
| growth12MoPct | % | Growth at 12 months. Typical scoping deliberately sizes for the 12-month volume to avoid mid-year re-procurement. | Annual tokens projection |
| growth36MoPct | % | Growth at 36 months. Sets the lease/CapEx horizon discussion. | TCO horizon decision |
2. The Sizing algorithm
Sizing answers "how many chips of type X do I need to serve this workload?" We use a three-constraint solver: each chip count is the maximum across three independent physical limits. Whichever is largest binds — that label is surfaced on the Sizing tab as "Binding: Compute / Bandwidth / Memory" so you know which constraint to relax.
Notation used in §2. Symbols are kept consistent through every formula.
P = active model parameters (count, not billions) ▸ from ModelProfile.activeB × 1e9 bpp = bytes per parameter ▸ 2 for BF16, 1 for FP8/INT8, 0.5 for FP4 T_tgt = target throughput (tokens/sec) ▸ derived from Spec eventsPerDay + tokensOut peakF = chip peak FLOPS ▸ from ChipProfile.bf16Tflops × 1e12 mfu = Model FLOPs Utilization (fraction, 0–1) ▸ default 0.40 (see §4) hbmBw = chip HBM bandwidth (bytes/sec) ▸ from ChipProfile.hbmBwGbps × 1e9 hbmCap = chip HBM capacity (bytes) ▸ from ChipProfile.hbmGb × 1e9 B = serving batch size ▸ from Spec or model default S = context length (tokens) ▸ from Spec contextMax L = model layers ▸ from ModelProfile or estimated D = hidden dimension ▸ from ModelProfile or estimated
2.1 Compute constraint
Each generated token costs roughly 2P FLOPs in forward pass (multiply-add per parameter, twice). To deliver T_tgt tokens/sec, the cluster needs at minimum:
chipsCompute = ⌈ (2 × T_tgt × P) / (peakF × mfu) ⌉ physical reading: numerator = total FLOPS/sec the workload demands denominator = effective FLOPS/sec each chip actually delivers under load
The 2 is the well-established forward-pass FLOPs-per-parameter constant (one multiply + one add per weight, per token). The × mfuacknowledges that real workloads never hit peak silicon spec; published benchmarks for H100 BF16 inference land 35–55% depending on batch and sequence — we default to 0.40 for conservative sizing. (Override per engagement when you have measured numbers.)
2.2 Bandwidth constraint
Decode is memory-bandwidth bound: every output token streams the model's weights from HBM into the compute units. With batch size B, those weights are reused B times per stream, so batching amortizes the cost.
chipsBandwidth = ⌈ (T_tgt × P × bpp) / (B × hbmBw × mbu) ⌉ physical reading: numerator = total bytes/sec the workload must read across all streams denominator = effective bytes/sec each chip's HBM delivers per batch slot
The × mbu is the bandwidth analogue of mfu on the compute side: Memory Bandwidth Utilization — the fraction of peak HBM bandwidth you actually sustain during decode. You never hit the spec number; real MBU on current accelerators lands roughly 50–65% depending on batch, sequence, and kernel quality. We default to 0.50 for conservative sizing (override per org in Finance settings, or per run). A lower MBU means more chips to hit the same token rate.
This is why short-context interactive workloads with tight latency targets (small batch sizes) often hit the bandwidth wall before the compute wall — and why long-context batch workloads can be compute-bound instead. Quantization (FP8 halves bpp) is one of the few levers that helps both compute and bandwidth simultaneously.
2.3 Memory constraint
The model's weights must fit in HBM. So must the KV-cache for every active sequence in the batch. The cluster must hold both, concurrently:
modelBytes = P × bpp ▸ weights, in bytes kvBytes = 2 × L × D × S × B × bpp ▸ KV-cache for active batch chipsMemory = ⌈ (modelBytes + kvBytes) / hbmCap ⌉
The KV-cache formula comes from attention mechanics: each layer caches 2tensors (K and V), each of size D per token, for S tokens of context, across Bconcurrent sequences. Long-context (large S) workloads can have KV-cache that exceeds the model itself — a 200K-token context on a 70B model is comparable to the model's own weight footprint.
2.4 Binding constraint
The cluster needs to satisfy all three constraints, so the chip count is the max:
N = max(chipsCompute, chipsBandwidth, chipsMemory, 1)
binding = "Compute" if chipsCompute is the max
"Bandwidth" if chipsBandwidth is the max (tie-breaking favors bandwidth over memory)
"Memory" otherwiseThe binding label appears in the brief and the Sizing tab. It tells you what to relax to lower cost. If compute-bound: a smaller model or quantization. If bandwidth-bound: larger batches or higher-bandwidth silicon. If memory-bound: shorter contexts or model distillation.
After N is set, throughput is back-computed as the minimum of what compute and bandwidth can actually deliver at that count:
throughputCompute = (N × peakF × mfu) / (2 × P) throughputBandwidth = (N × hbmBw × B) / (2 × P × bpp) throughputActual = min(throughputCompute, throughputBandwidth)
3. The Cost algorithm
Once the cluster is sized (N chips), we stack three cost streams: hardware, power, and overhead OpEx. The whole thing collapses into one number — $/M-token — so customers can compare like-for-like across chips, models, and providers.
3.1 Hardware cost
Hardware cost has two flavors depending on acquisition mode:
# CapEx mode: amortize the purchase over the depreciation horizon. totalChipCost = N × chipUnitPrice totalRackInfra = ⌈N / 32⌉ × rackInfraCost ▸ 32 = 8 chips/node × 4 nodes/rack totalCapEx = totalChipCost + totalRackInfra annualCapEx = totalCapEx / amortYears ▸ default 4 years # Lease mode: monthly bill, no amortization. monthlyHwLease = N × chipLeasePerMonth ▸ from vendor / colo provider annualHwLease = monthlyHwLease × 12
Rack infrastructure cost covers PDU, network switches, cabling, cold-aisle containment — the things you buy alongside the chips. Default is amortized over the same horizon as chips, unless the lease includes it (vendor-dependent).
3.2 Power cost
Each chip's nameplate TDP (Thermal Design Power) is its max sustained draw. Datacenter total draw is higher because cooling and conversion losses add overhead — captured by the PUE (Power Usage Effectiveness) multiplier.
clusterPowerKW = (N × chipTdpW × PUE) / 1000 ▸ PUE default 1.30
annualPowerCost = clusterPowerKW × 8,760 × kwhCost
└────────────┘ └─────┘ └──────┘
draw at the wall hr/yr $/kWh you paykwhCost is engagement-specific — typical US datacenter rates are $0.04–$0.12/kWh; renewable PPAs can go lower; constrained markets higher. We don't assume; the FDE inputs the customer's actual contracted rate (or industry average if unknown), and the brief notes which.
3.3 OpEx overhead
Real operations cost more than chips + power. Networking egress, software licenses, on-call engineering, monitoring, backup, datacenter staff — collectively the OpEx overhead. We bundle these as a single percentage on top of subtotal:
annualSubtotal = annualHw + annualPowerCost annualOpExOverhead = annualSubtotal × opexOverheadPct% annualTotalCost = annualSubtotal + annualOpExOverhead
Typical opexOverheadPct ranges 15–35% depending on customer maturity (in-house team with existing infra: low; greenfield deployment: high). The FDE sets this per engagement; the brief shows what was chosen.
3.4 $/M-token derivation
The unifying metric. Take the cluster's sustained throughput (from §2.4), project annual tokens at expected utilization, divide by total annual cost:
annualTokens = throughputActual × 31,536,000 × utilizationPct%
(sec/year) └────────────────┘
fraction of time cluster is serving load,
typically 60–80% for production inference
costPer1MToken = annualTotalCost × 1,000,000 / annualTokensLower utilization → fewer tokens generated per dollar of fixed cost → higher $/M-token. This is why a lightly-loaded reserved cluster looks expensive vs cloud-on-demand. The Cost tab renders both: CapEx-amortized $/M-token and equivalent cloud-rate $/M-token, so the customer sees the crossover.
Sensitivity rendering
Where a Spec field has explicit uncertainty (the agentCallsMin/Max pair), $/M-token in the brief is shown as a band: the calculation runs at both ends and the customer sees the range. Mid-points are arithmetic, not editorial — we don't pick the "likely" value, we show the spread.
4. Constants reference
Every named constant used above. These don't change per customer; they change only when the methodology version bumps (and you can compare deltas across versions).
| constant | value | rationale |
|---|---|---|
| BYTES_PER_PARAM | 2 | BF16/FP16 weights are 2 bytes per parameter. Quantized (FP8/INT8) is 1 byte, FP4 is 0.5 bytes. Switches based on the precision the model is served at. |
| MFU (default) | 0.40 | Conservative Model FLOPs Utilization — what fraction of peak FLOPS the model actually achieves under realistic batching. Published H100 BF16 inference benchmarks range 35–55% depending on batch size + sequence length. We default to the lower end so projections don't over-promise. |
| SERVER_OVERHEAD | 1.30 | 30% headroom on top of the math-derived minimum. Covers OS overhead, networking, framework inefficiency, prefill/decode imbalance. Standard rule of thumb in datacenter sizing. |
| INFERENCE_MEM_MULT | 1.20 | 20% headroom on model weights memory to allow for activation tensors during forward pass. |
| PUE | 1.30 | Power Usage Effectiveness. Total facility power ÷ IT load. 1.3 is a representative modern datacenter (cooling + losses add 30% to chip TDP). Override per engagement if the customer has a measured value. |
| CHIPS_PER_NODE | 8 | Industry standard: NVIDIA HGX-H100 = 8 GPUs/node. TPU pods, AMD MI300 systems also normalize at 8/node. |
| NODES_PER_RACK | 4 | 8 chips × 4 nodes = 32 chips/rack. Matches typical 4U HGX server form factor at 42U rack density with power + cooling budget. |
| HOURS_PER_YEAR | 8,760 | 365 × 24. Used for annual energy cost. |
| SECONDS_PER_MONTH | 2,592,000 | 30 days × 86,400 s/day. Standardizes monthly throughput-to-tokens conversion. |
| AMORT (default) | 4 years | Standard accelerator depreciation horizon. Customers running CapEx scenarios can override (3–5 yr typical). |
Catalog values (chip TFLOPS, HBM GB, HBM bandwidth, TDP, list price, model active params, model layers, hidden dim) are not constants — they live in Mintok's chip and model catalog and update as vendors publish new specs. The brief always cites the catalog version it ran against.
5. Glossary
- MFU (Model FLOPs Utilization)
- The fraction of peak silicon FLOPS actually consumed by useful model computation. A 200-TFLOPS chip running at 0.40 MFU delivers 80 effective TFLOPS for the model.
- MBU (Memory Bandwidth Utilization)
- Analogous metric for HBM. Decode is bandwidth-bound; prefill is compute-bound. Our cost formula uses the binding-constraint approach to handle both.
- KV-cache
- Key-value tensors cached per layer per token for the attention mechanism. Size grows linearly with context length and batch size. Often the memory-binding constraint for long-context workloads.
- Binding constraint
- The physical limit that determines minimum chip count. If your model needs 10 chips for compute, 4 for bandwidth, and 8 for memory, you need 10 (the max). Compute is the binding constraint in that case.
- $/M-token
- "Dollars per million tokens." Mintok's unifying metric. All capex / power / opex collapse into this single number for like-for-like comparison.
- Active parameters
- For Mixture-of-Experts models, only a subset of params is active per token. We size against active params (e.g., DeepSeek-R1 = 671B total, 37B active).
- Prefill vs decode
- Prefill = processing the input prompt (one large parallel matmul). Decode = generating output tokens one at a time (autoregressive, smaller per-step compute, memory-bandwidth bound).
- TCO horizon
- The years over which total cost is summed. 3-year TCO is typical for cloud comparison; 5-year for CapEx amortization.
- Sensitivity table
- Cost projection rendered as a range across uncertain inputs (typically agent-calls-per-event). The customer sees a band, not a point estimate, because we refuse to fake precision we don't have.
Spotted a discrepancy? If a number in your brief doesn't reconcile with these formulas, it's a bug — not undocumented behavior. Reply to the brief email and we'll trace it. Methodology version v0.1 is current as of 2026-05-14. Future versions will be diffed against this one. Your brief stays pinned to the version that was current when it was authored, regardless of future changes.