Running tracesage in production¶

tracesage is a single-process embedded library — local-first, not a hosted production-monitoring service. This guide covers the knobs and safety rails for running it beyond your laptop (auth, sampling, retention) plus exporting to a real production observability stack via OpenTelemetry. It's architected so future versions can split the producer (your app) from the consumer (a centralized trace server).

Defaults that matter¶

Binds 127.0.0.1 only. Set auth_token if you want to expose the UI.
sample_rate=1.0 — capture every event. Lower in high-volume production.
per_run_event_cap=50000 — single runaway run won't fill the queue.
Queue drops events instead of blocking the user's pipeline. Counter exposed in /api/stats.

Turning tracesage off (kill switch)¶

Wire tracesage in once and disable it per-environment — no code change:

export TRACESAGE_ENABLED=false      # prod: complete no-op

When disabled, TraceSage.create() / tracesage.trace() / TraceSage.session() return an inert tracer: no embedded server is started, no DB/worker/queue is created, .handler is a no-op callback, and overhead is near zero. Your callbacks=[tracer.handler] (or global install()) wiring keeps working untouched — it just does nothing. This is the recommended way to ship the integration in code but keep it out of an environment that doesn't want it.

Overhead when disabled (measured against a near-instant fake LLM, so it's a worst-case microbenchmark — real LLM latency dwarfs all of these):

Startup: TraceSage.create() returns in ~0.3 ms and binds nothing — no port, no DB file, no worker task, no background thread.
Per call, global-install pattern (tracesage.trace() / install=True): effectively zero — install() is a no-op, so nothing is registered with LangChain and your calls take the same path as if tracesage weren't there.
Per call, explicit callbacks=[tracer.handler] pattern: a small fixed cost (~14 µs/call in the microbenchmark) from LangChain building a callback manager and doing one skip-check per event. It does not scale with your workload; against a real LLM call (hundreds of ms) it is < 0.01%. The disabled handler sets every LangChain ignore_* flag, so its methods are never actually invoked.

For literal zero overhead with the same code in every environment, prefer the global-install pattern and toggle TRACESAGE_ENABLED.

Recommended production config¶

If you DO want tracing in prod but not the dev conveniences:

from tracesage import TraceSage, TraceSageConfig

cfg = TraceSageConfig(
    enabled=True,
    start_server=False,     # don't run the embedded UI server in your app process
    print_run_url=False,    # no stderr trace-link spam
    sample_rate=0.1,        # capture a slice (see Sampling guidance below)
)
tracer = await TraceSage.create(config=cfg)   # config drives start_server

Every field above has a TRACESAGE_* env var (TRACESAGE_START_SERVER=false, TRACESAGE_PRINT_RUN_URL=false, TRACESAGE_SAMPLE_RATE=0.1, …), so you can ship the same code and tune it per environment without editing it. A start_server= kwarg to create()/session() overrides the config when you need to force it.

Long-running async service (e.g. FastAPI). Create the tracer once at startup and reuse its handler across requests; stop it on shutdown so the queue drains:

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app):
    app.state.tracer = await TraceSage.create(config=cfg)
    try:
        yield
    finally:
        await app.state.tracer.stop()   # drains the queue, closes the DB

Then pass config={"callbacks": [app.state.tracer.handler]} on each ainvoke, or call app.state.tracer.install() once after create to capture globally. (The async with TraceSage.session(...) context manager is for scripts/one-shot runs — it tears down at block exit.)

Then view the data out-of-band with tracesage serve --data-dir <dir> (or point it at durable storage). Note the embedded SQLite + local blobs are best for single-process / single-node today; centralized multi-process collection is on the roadmap (see production_roadmap.md).

Sampling guidance¶

Volume	Recommended sample_rate
< 100 runs/day	1.0 (capture everything)
100–10,000 runs/day	0.5 to 1.0
10,000–100,000 runs/day	0.1 to 0.5
> 100,000 runs/day	0.01 to 0.1, plus export to a central backend via OTLP (below)

Sampling is per-root-run, so partial traces never happen.

Exporting to a central observability stack (OpenTelemetry)¶

For multi-process or production deployments, set otlp_endpoint to also stream every trace as OpenTelemetry spans to your collector / Tempo / Jaeger / Datadog / Honeycomb — the local SQLite store remains for the dev view, while the OTLP backend handles centralization, retention, and cross-service correlation:

pip install "tracesage[otel]"
export TRACESAGE_OTLP_ENDPOINT=http://otel-collector:4318

Export is best-effort and never blocks or breaks the app if the collector is down. See Configuration → OpenTelemetry export for the full mapping.

Authentication¶

Set auth_token (any random string ≥ 32 chars) and the server enforces:

HTTP: Authorization: Bearer <token> on every endpoint except /api/health.
WebSocket: ?token=<value> query param OR Sec-WebSocket-Protocol subprotocol.

Constant-time comparison via hmac.compare_digest prevents timing leaks.

export TRACESAGE_HOST=0.0.0.0
export TRACESAGE_AUTH_TOKEN=$(openssl rand -hex 32)

tracesage will refuse to start if host is non-loopback and auth_token is unset.

TLS¶

tracesage does not terminate TLS. Run behind a reverse proxy (nginx, Caddy, Traefik, AWS ALB). Example nginx:

location /tracesage/ {
    proxy_pass http://127.0.0.1:7842/;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
}

The UI uses relative URLs so prefix paths work.

Retention¶

The runs and events tables grow without bound by default. To enforce a retention cap:

# Daily cron / systemd timer
tracesage gc --max-runs 50000

This deletes the oldest runs and their blobs, in FIFO order.

Backup¶

The data directory is fully portable:

rsync -a --delete ~/.tracesage/ backup-host:~/tracesage_$(date +%F)/

Include traces.db, traces.db-wal, traces.db-shm, and the blobs/ directory.

What to monitor (the watcher's watcher)¶

GET /api/stats returns:

events_processed — total events written to DB. Should grow steadily.
events_dropped — should be 0 in healthy state. If non-zero, your queue is full — increase queue_maxsize or reduce sample_rate.
events_sampled_out — counter of events skipped by sampling. Informational.
runs_throttled — counter of runs that hit per_run_event_cap. Investigate runaway agents.
last_write_latency_ms — most recent batch write latency.
p99_write_latency_ms — rolling p99 over last 1000 batches. Should be < 500 ms typical.
queue_depth — current queue size. Should hover near 0.

Performance characteristics¶

Bench (Linux x86, NVMe SSD, ~5K events with 20% blob-eligible): - Sustained: 800–1200 ev/s - p99 write latency: 80–150 ms - 0 dropped under sustained load

Bench (Windows NTFS, same machine): - Sustained: 60–100 ev/s (gzip + fsync amplification) - p99 write latency: 1000–2000 ms - 0 dropped under sustained load

If you need more throughput on Windows, raise worker_batch_size to 200 and worker_batch_timeout to 0.5.

Troubleshooting¶

"Refusing to start: 0.0.0.0 without TRACESAGE_AUTH_TOKEN" — set the env var.
"Queue drain timed out" at shutdown — your worker fell behind; some events lost. Raise queue_maxsize or shutdown with more time.
"FOREIGN KEY constraint failed" in worker logs — usually means your LangGraph emitted events for a run_id that wasn't preceded by chain_start. Worker auto-creates the runs row but logs the issue.
Events visible but no agent_name — your custom Runnable doesn't supply a name. Pass with_config(run_name="MyAgent") on the Runnable.

Upgrade path¶

A future release will add a remote-server mode (producer → HTTP → trace server) that preserves your existing API. The data directory format will be compatible — old data accessible via tracesage serve against the existing dir.