One of the most common questions we get: can I run Axon with a local model?

Yes. And the architecture that makes it possible is worth explaining, because it's not obvious.

The split

Axon has two layers that never collapse into each other:

Cognos — the cognition runtime. It runs the loop. It manages state, compression, stop conditions, tool dispatch. It calls $engine() when it needs tokens. It doesn't care where those tokens come from.

Axon — the user surface. It defines tools, identity, policy. It also runs in the privileged environment: your machine, your network, your local inference server.

This split is intentional and non-negotiable. Cognos runs in the cloud, unprivileged. It never touches your machine directly. Axon runs where it needs privileged access — locally in dev, on Cloud Run in production.

The engine bridge

When you configure a local inference provider:

// axon.config.ts
export default defineAgent({
    engine: {
        provider: "ollama",
        model: "llama3.2",
        baseUrl: "http://localhost:11434",
    },
})

Here's what happens:

  1. Your Axon worker registers an engine.complete RPC handler — an adapter for Ollama or any OpenAI-compatible server
  2. The worker connects to the Cognos runtime with a capability flag: { localEngine: true }
  3. When Cognos needs tokens, it calls $engine("local") in its cognitive loop
  4. That call routes via WebSocket RPC back to your Axon worker
  5. Your worker calls the local inference server, streams tokens back to Cognos
  6. Cognos continues the loop

Your machine is the privileged execution environment for both tool calls and inference. Cognos receives tokens — it never touches your local network.

Why this matters

The alternative architectures are worse:

  • Ship Cognos locally: breaks the proprietary boundary, can't update the runtime without pushing new versions
  • Bypass Cognos for local models: you lose the loop — no tools, no state, no stop conditions, no memory
  • Proxy everything through the cloud: latency, privacy concerns, unnecessary roundtrips

The engine bridge gives you local inference with full cognitive capabilities, zero infrastructure exposure, and hot-upgradeable runtime.

The result

axon dev  # starts local agent with Ollama

Cognos is running in the cloud. Your inference is running on your machine. Tools execute locally. The loop runs everywhere it needs to.

That's the bet: unbundle inference from cognition, let each live where it belongs.