AI Cost Management
AI bills are surprising for two reasons: nobody told you what a token is, and nobody set a cap. This page fixes both.
Tokens, in one minute
A token is the unit AI providers charge in. Roughly:
- 1 token ≈ 4 characters of English ≈ ¾ of a word.
- 1,000 tokens ≈ 750 words ≈ a short email.
- A typical contract or 10-page PDF: 5,000–15,000 tokens.
- A long meeting transcript: 20,000–50,000+ tokens.
You pay for input tokens (what you send) and output tokens (what the model generates). They are usually priced differently — output is typically 3–5× more expensive than input.
For more on the concept, see the Token entry in the Executive Glossary.
The pricing math, by hand
Every prompt costs:
cost = (input_tokens / 1000 × input_price) + (output_tokens / 1000 × output_price)
A worked example. Suppose you are summarizing customer support tickets:
| Input | Value |
|---|---|
| Avg ticket size | 800 tokens |
| Avg summary size | 200 tokens |
| Tickets per day | 500 |
| Input price (per 1K tokens) | $0.0025 |
| Output price (per 1K tokens) | $0.01 |
Daily cost:
- Input:
500 × 800 / 1000 × $0.0025 = $1.00 - Output:
500 × 200 / 1000 × $0.01 = $1.00 - Total: ~$2/day, ~$60/month.
Now imagine the same workflow with a long-context "premium" model at 10× the price. Same math: ~$600/month. Same workflow, an order of magnitude more, possibly for no measurable quality gain.
The lesson: always do the math before picking the model.
Where costs go out of control
| Cause | Symptom | Fix |
|---|---|---|
| Stuffing the whole knowledge base into every prompt | Token spend grows linearly with corpus size. | Use RAG so only the relevant chunks get sent. See RAG Preparation. |
| Long, repetitive system prompts | Every request pays the system prompt cost. | Trim ruthlessly. Cache static system prompts where the provider supports it. |
| Looping agents | A single user request fires 50 tool calls. | Per-agent budget caps and tool-call ceilings — see Agent Architecture Standards. |
| Premium model on a routine task | $$$ per call. | Route by task: cheap model for classification, premium only when needed. |
| Context window inflation | Prompts grow over a long conversation. | Summarize or truncate older turns. |
| Test runs in dev | Eval suites spending production money. | Separate dev budget. Run the eval against the cheaper model when iterating prompts; verify on the production model before shipping. |
Cost vs. accuracy: the real trade-off
Bigger models cost more and are more accurate, but only sometimes:
- For structured extraction with clear inputs, a small fast model often matches a premium model. Test both.
- For open-ended reasoning, premium models genuinely help.
- For routing or classification, a cheap model + a clear prompt is almost always enough.
- For safety-critical work, premium model + an eval gate beats cheap model + hope.
Run the Agent Evaluation Framework on candidate models. If a cheaper model holds the score, take the savings.
When to switch models
Trigger a model review when:
- Cost per request crosses a threshold you set in advance.
- A new model from your current provider drops at lower cost or higher quality.
- Eval scores plateau and you need a step change.
- You move from prototype to production scale (the price of "free credits" stops applying).
The discipline: never switch a production model without re-running the eval. It is the only way to know whether the new model actually preserves quality on your work.
Budget guardrails
Set these before launch, not after the first surprise invoice:
| Guardrail | Where it lives |
|---|---|
| Per-agent monthly budget | Provider dashboard or your own meter. |
| Per-request token cap | In the API call (max_tokens). |
| Tool-call ceiling per request | In the agent runtime. |
| Daily spend alert (e.g. 50%, 80%, 100% of cap) | Monitoring system. |
| Hard kill switch above cap | Feature flag. |
| Anomaly alert (e.g. cost per request 3× rolling average) | Monitoring system. |
Cost monitoring is a first-class signal — it lives in the same dashboard as latency, eval score, and HITL approval rate (see Agent Architecture Standards).
A back-of-the-envelope template
Before any new use case, estimate:
| Item | Your value |
|---|---|
| Avg input tokens per request | |
| Avg output tokens per request | |
| Requests per day | |
| Days per month | |
| Input price per 1K | |
| Output price per 1K | |
| Estimated monthly cost | ((avg_in × req × days) / 1000 × in_price) + ((avg_out × req × days) / 1000 × out_price) |
If the estimate is uncomfortable, look at the corpus, the model, and the loops before the launch — not after.
Quick wins for cost reduction
- Trim system prompts. Most are 2–3× larger than they need to be.
- Use RAG instead of long context. Sending only relevant chunks beats sending the whole document.
- Cache. Repeated queries should hit a cache, not the model.
- Batch. When the workflow allows, group small requests into one larger call.
- Cheap model for routing. Use a small model to decide which prompt or tool to use; only the chosen step pays for the premium model.
- Truncate old turns in long conversations. Older context rarely helps current accuracy.
Need help implementing or feeling stuck? Contact us today to establish a consulting relationship.