The Model Comes Last | Anthony Cote Blog

A lot of AI strategy starts in the wrong place.

The first question is usually some version of: which model should we use? GPT, Claude, Gemini, open weights, hosted inference, local inference, small model, large model, agent framework, vector database, orchestration platform. The market encourages that question because it is easy to sell. It turns architecture into a shopping list and makes progress feel like procurement.

But the model is not the strategy. The model is not the system. The model is not even the first architectural decision.

The model comes last.

That does not mean the model is unimportant. Model capability matters. Latency matters. cost matters. context length matters. tool-use reliability matters. safety behavior matters. But those details only become meaningful after the system knows what it is trying to do, what it is allowed to do, what it must never do, what data it can see, what actions it can take, what human approvals are required, and what failure looks like.

A model selected before those decisions are made is not an architecture decision. It is a guess.

Start with the operating model

The first layer of AI architecture is the operating model. Before choosing a model, you need to understand how work moves.

Who initiates the workflow? What is the user’s intent? What information does the system need to retrieve? What authority does the user have? What systems does the workflow touch? What decisions are reversible? What decisions are irreversible? What is low risk? What requires approval? What should be logged? What should be hidden from the model entirely? What does success mean in operational terms?

These questions are not slower than model selection. They are the work that makes model selection useful.

A customer-support assistant, for example, does not need one generic intelligence layer. It needs a resolution policy, a customer identity boundary, access rules, escalation logic, tone constraints, refund authority, documentation retrieval, case history summarization, and a definition of when the system is allowed to act versus recommend. Only after that does model choice become clear.

A sales research agent has different needs. It may need web context, CRM access, account history, structured qualification criteria, a distinction between research and outreach, and strict controls around what can be sent externally. The best model for drafting an email may not be the best model for classifying account fit. The best model for synthesizing research may not be the best tool for deduping records.

A product strategy assistant is different again. It may need long-context synthesis, citation discipline, strong reasoning, and very few write actions. A scheduling agent may need almost no large-model reasoning at all once intent is classified.

When the operating model is clear, the system stops asking for one perfect model. It starts assigning work to the right components.

The problem with model-first thinking

Model-first thinking creates avoidable fragility.

The first failure is over-generalization. A team chooses a powerful model and expects it to handle everything: intent classification, policy enforcement, retrieval decisions, tool selection, formatting, reasoning, and final response generation. That can work in a demo because the path is narrow. In production, it turns the model into a catch-all for every unresolved design decision.

The second failure is cost leakage. If every task is routed through a frontier model, the system pays premium inference prices for work that could often be handled by a classifier, rules engine, deterministic query, template, or small local model. The organization then concludes that AI is expensive, when the real issue is architectural laziness.

The third failure is hidden risk. When a model is responsible for deciding both what should happen and how to express it, the boundary between policy and output becomes blurry. The model may appear aligned in normal cases but become unreliable under unusual pressure, incomplete context, or conflicting instructions.

The fourth failure is vendor lock-in disguised as capability. If orchestration, data formatting, tool logic, and business rules are shaped around one provider’s assumptions, the system becomes hard to move. The team is no longer buying a model. It is inheriting an architecture.

The fifth failure is evaluation confusion. Teams compare models on output preference rather than system performance. The chosen model may write beautifully while the workflow still fails because retrieval is weak, tool schemas are bloated, permissions are unclear, or the task should not have been generative in the first place.

The common thread is simple: the model is being asked to compensate for missing architecture.

The model-routing layer is the strategic layer

A mature AI system is rarely one model. It is a routing layer.

Some requests need no model at all. Some need a tiny classifier. Some need a cheap hosted model. Some need a local model because the data should not leave the device or network. Some need a frontier model because the task requires nuanced synthesis or complex reasoning. Some need multiple passes with different components: classify, retrieve, validate, generate, check, then act.

This is where serious architecture begins to show up.

A routing layer can decide whether a request needs email, calendar, CRM, file search, documentation retrieval, or no tool. It can decide whether the request is read-only or action-oriented. It can decide whether a local deterministic process should run before the large model sees anything. It can decide whether sensitive data should be redacted, summarized, or excluded. It can decide whether the user is asking for something the system is not allowed to do.

The routing layer is not glamorous. It is also where much of the value lives.

Good routing reduces cost because expensive models are reserved for expensive problems. It improves latency because simple tasks do not wait behind complex reasoning. It improves safety because the system can apply hard constraints before generation. It improves reliability because specialized components handle the work they are best suited to handle.

This is one of the reasons I keep returning to deterministic-first architecture. The point is not to avoid intelligence. The point is to prevent intelligence from being wasted on work that should be structured.

Context is a budget, not a landfill

Model-first systems often try to solve reliability by adding more context.

More instructions. More policy text. More examples. More tool schemas. More retrieved chunks. More memory. More formatting rules. More warnings. More explanations of what not to do.

Eventually the prompt becomes a landfill of unresolved system design.

Context is powerful, but it is not free. Every token has a cost. Every irrelevant instruction competes with relevant information. Every oversized schema increases cognitive load for the model. Every bloated prompt makes debugging harder because nobody knows which part of the context produced the behavior.

A stronger approach is to treat context as a managed budget.

The system should decide what context is required for this request, not every possible request. It should prefetch only relevant schemas. It should retrieve only useful documents. It should compress stable policies into deterministic checks where possible. It should use classifiers and structured intermediates to narrow the task before asking a large model to reason.

This is not just optimization. It is architectural discipline.

A lean context window is easier to evaluate. A smaller tool surface is easier to secure. A compressed schema is easier for the model to use correctly. A focused retrieval set is less likely to distract. The system becomes cheaper, faster, and often more accurate because it stops asking the model to sort through noise.

Architecture decides what intelligence means

The word intelligence is often treated as a property of the model. But in production systems, intelligence is also a property of the surrounding structure.

A mediocre model inside a strong architecture can outperform a powerful model inside a weak one for many operational tasks. It has better context, clearer constraints, narrower decisions, cleaner tools, and better feedback. It knows less in general, but it knows what matters for the task.

A powerful model inside a weak architecture may sound impressive while making poor operational decisions. It may generate confident answers from incomplete data. It may call tools unnecessarily. It may miss a boundary condition hidden in a policy document. It may follow a user’s instruction when it should ask for approval. It may optimize for fluency instead of correctness.

The architecture determines whether model capability becomes usable intelligence.

This is especially important for organizations adopting AI under real constraints. Most teams do not need a science project. They need systems that reduce work, preserve accountability, and fit into existing operations without creating silent risk. That requires a clear separation between intelligence, authority, and action.

The model may reason. The system decides whether that reasoning is allowed to become an operation.

Local components change the economics

The model-last principle becomes even more important as local and edge inference improve.

If the architecture assumes every task must go to a large hosted model, the system inherits a fixed economic shape. But if the architecture can route across local classifiers, small local models, hosted models, deterministic services, and frontier models, the economics become much more flexible.

A local classifier can decide whether a request needs a tool. A local embedding model can support private retrieval. A small model can draft structured summaries for low-risk internal workflows. A deterministic validator can check required fields. A hosted frontier model can be reserved for complex synthesis.

This is how AI systems become sustainable. Not by hoping model prices fall forever, but by designing architectures that use the right level of intelligence for the job.

It also changes privacy. Sensitive workflows can keep more data near the user, the device, or the organization. Cloud models can still be used when needed, but locality becomes a deliberate design choice rather than an impossibility.

The future will not be purely local or purely cloud. It will be routed.

The best model is contextual

There is no best model in the abstract.

There is a best model for a task, a risk level, a cost envelope, a latency requirement, a context shape, a language requirement, a privacy boundary, a tool-use pattern, and a failure tolerance.

That is why model selection should happen late. By the time the architecture is clear, the selection criteria are no longer vague. You can ask specific questions.

Does this task require deep reasoning or just classification? Does it need long context or precise retrieval? Does it need structured output? Does it need reliable tool use? Can it run locally? What is the acceptable cost per action? What happens if it fails? How much ambiguity should be escalated to a human? Is speed more important than nuance? Is privacy more important than model strength?

These questions make model choice concrete.

They also make replacement easier. If the model sits behind a well-designed interface, the organization can swap providers, introduce local components, or change routing policies without rebuilding the whole system. That flexibility is strategic. It prevents today’s model decision from becoming tomorrow’s architecture prison.

The operating model is the moat

For companies building AI-native products, the durable advantage is rarely just access to a model. Frontier capability diffuses. APIs become cheaper. Open models improve. Tooling gets copied. The deeper advantage is understanding the operating model better than competitors do.

How should work actually move? What should be automated? What should be augmented? What should remain human? What information should be retrieved at which moment? What are the real constraints in the domain? Where do users need confidence? Where do they need control? Where do they need speed? Where do they need taste?

Those answers shape the architecture. The architecture shapes the product. The model then serves the product.

This is the correct order.

When the model comes first, the product becomes a wrapper around capability. When the operating model comes first, the model becomes a component inside a system with a reason to exist.

That distinction will matter more as AI becomes more common. The novelty of generation will fade. Users will stop being impressed that software can produce text, images, summaries, and suggestions. They will care whether the system actually improves the work.

That improvement comes from architecture.

The model comes last because purpose comes first

The deepest reason the model comes last is that purpose comes first.

A system should know what it is for before it chooses how to think. It should know its operating envelope before it is given tools. It should know its human accountability structure before it is allowed to act. It should know its evaluation criteria before it optimizes for outputs.

This is not a philosophical luxury. It is practical engineering.

AI systems fail when their components are more sophisticated than their purpose is clear. They succeed when intelligence is placed inside a structure that can preserve the relationship between objective, criterion, and action over time.

Choose the operating model. Define the boundaries. Build the routing layer. Decide what should be deterministic. Decide where human judgment belongs. Design the data flow. Instrument the system. Then choose the model.

The model comes last not because it matters least, but because it should serve everything that matters more.