← blog

AI Tools Calling

How the Advisor Chain architecture cuts token usage by 90% while keeping LLM responses accurate and personalized.

llmragtoolcalling

The naive approach to personalized LLM responses is to dump everything into the prompt: user history, preferences, device state, external data. It works, but it's expensive. A single query can easily consume 40,000 tokens, most of which the model doesn't need.

The Advisor Chain is an architecture I explored that addresses this directly.

The problem with full-context injection

When you inject all available context upfront, you pay for tokens you don't use. More importantly, you introduce noise — the model has to reason over a large amount of irrelevant data to find what matters for this specific query. This degrades both performance and cost.

The Advisor Chain

Instead of injecting context statically, the Advisor Chain retrieves only what's needed, dynamically, at query time.

The key architectural decision is to run tool calls in a separate chain, not embedded inside the model's main prompt. This keeps the primary reasoning context clean and small, while the advisor layer handles retrieval in parallel.

The result: ~3,000 tokens per query instead of ~40,000. A 90% reduction.

What makes it work

Explicit tool selection guidelines. Rather than letting the model decide which tools to call through agentic reasoning, tool selection is governed by explicit rules. This makes behavior predictable and avoids the overhead of agent-style planning loops.

Strict prompting. Model configurations enforce reliable tool invocation. The model is constrained to use tools in expected ways — not because it can't reason freely, but because reliability matters more than flexibility in production.

Asynchronous execution. Where multiple tools are needed, they run in parallel. This keeps latency manageable even as the number of available tools grows.

Prompt caching. Repeated structural elements of the prompt are cached, reducing both latency and cost on subsequent queries.

What it handles

The architecture covers the main categories of real-world personalization needs:

  • Intent classification
  • Hybrid data retrieval (structured + unstructured)
  • External API calls
  • Device state and control
  • Feedback collection

The tradeoff

You give up some flexibility. An agent-based approach can adapt to novel situations; the Advisor Chain can't. But for most production use cases, the reliability and cost gains are worth it. Adaptive reasoning is rarely what's actually needed — accurate, fast, cheap retrieval usually is.