AI Cost Management

AI bills are surprising for two reasons: nobody told you what a token is, and nobody set a cap. This page fixes both.

Tokens, in one minute

A token is the unit AI providers charge in. Roughly:

1 token ≈ 4 characters of English ≈ ¾ of a word.
1,000 tokens ≈ 750 words ≈ a short email.
A typical contract or 10-page PDF: 5,000–15,000 tokens.
A long meeting transcript: 20,000–50,000+ tokens.

You pay for input tokens (what you send) and output tokens (what the model generates). They are usually priced differently — output is typically 3–5× more expensive than input.

For more on the concept, see the Token entry in the Executive Glossary.

The pricing math, by hand

Every prompt costs:

cost = (input_tokens / 1000 × input_price) + (output_tokens / 1000 × output_price)

A worked example. Suppose you are summarizing customer support tickets:

Input	Value
Avg ticket size	800 tokens
Avg summary size	200 tokens
Tickets per day	500
Input price (per 1K tokens)	$0.0025
Output price (per 1K tokens)	$0.01

Daily cost:

Input: 500 × 800 / 1000 × $0.0025 = $1.00
Output: 500 × 200 / 1000 × $0.01 = $1.00
Total: ~$2/day, ~$60/month.

Now imagine the same workflow with a long-context "premium" model at 10× the price. Same math: ~$600/month. Same workflow, an order of magnitude more, possibly for no measurable quality gain.

The lesson: always do the math before picking the model.

Where costs go out of control

Cause	Symptom	Fix
Stuffing the whole knowledge base into every prompt	Token spend grows linearly with corpus size.	Use RAG so only the relevant chunks get sent. See RAG Preparation.
Long, repetitive system prompts	Every request pays the system prompt cost.	Trim ruthlessly. Cache static system prompts where the provider supports it.
Looping agents	A single user request fires 50 tool calls.	Per-agent budget caps and tool-call ceilings — see Agent Architecture Standards.
Premium model on a routine task	$$$ per call.	Route by task: cheap model for classification, premium only when needed.
Context window inflation	Prompts grow over a long conversation.	Summarize or truncate older turns.
Test runs in dev	Eval suites spending production money.	Separate dev budget. Run the eval against the cheaper model when iterating prompts; verify on the production model before shipping.

Cost vs. accuracy: the real trade-off

Bigger models cost more and are more accurate, but only sometimes:

For structured extraction with clear inputs, a small fast model often matches a premium model. Test both.
For open-ended reasoning, premium models genuinely help.
For routing or classification, a cheap model + a clear prompt is almost always enough.
For safety-critical work, premium model + an eval gate beats cheap model + hope.

Run the Agent Evaluation Framework on candidate models. If a cheaper model holds the score, take the savings.

When to switch models

Trigger a model review when:

Cost per request crosses a threshold you set in advance.
A new model from your current provider drops at lower cost or higher quality.
Eval scores plateau and you need a step change.
You move from prototype to production scale (the price of "free credits" stops applying).

The discipline: never switch a production model without re-running the eval. It is the only way to know whether the new model actually preserves quality on your work.

Budget guardrails

Set these before launch, not after the first surprise invoice:

Guardrail	Where it lives
Per-agent monthly budget	Provider dashboard or your own meter.
Per-request token cap	In the API call (`max_tokens`).
Tool-call ceiling per request	In the agent runtime.
Daily spend alert (e.g. 50%, 80%, 100% of cap)	Monitoring system.
Hard kill switch above cap	Feature flag.
Anomaly alert (e.g. cost per request 3× rolling average)	Monitoring system.

Cost monitoring is a first-class signal — it lives in the same dashboard as latency, eval score, and HITL approval rate (see Agent Architecture Standards).

A back-of-the-envelope template

Before any new use case, estimate:

Item	Your value
Avg input tokens per request
Avg output tokens per request
Requests per day
Days per month
Input price per 1K
Output price per 1K
Estimated monthly cost	`((avg_in × req × days) / 1000 × in_price) + ((avg_out × req × days) / 1000 × out_price)`

If the estimate is uncomfortable, look at the corpus, the model, and the loops before the launch — not after.

Quick wins for cost reduction

Trim system prompts. Most are 2–3× larger than they need to be.
Use RAG instead of long context. Sending only relevant chunks beats sending the whole document.
Cache. Repeated queries should hit a cache, not the model.
Batch. When the workflow allows, group small requests into one larger call.
Cheap model for routing. Use a small model to decide which prompt or tool to use; only the chosen step pays for the premium model.
Truncate old turns in long conversations. Older context rarely helps current accuracy.

Need help implementing or feeling stuck? Contact us today to establish a consulting relationship.

Tokens, in one minute​

The pricing math, by hand​

Where costs go out of control​

Cost vs. accuracy: the real trade-off​

When to switch models​

Budget guardrails​

A back-of-the-envelope template​

Quick wins for cost reduction​