Operations - Spendra

This page defines the operational checks for managed and on-prem Spendra deployments. In enterprise on-prem, the customer operates Kubernetes, Postgres, ingress/TLS, backups, secrets, observability, and SIEM forwarding.

Release validation

Before promoting a deployment:

pnpm install
pnpm typecheck
pnpm lint
pnpm test
pnpm build
pnpm onprem:chart:lint
pnpm onprem:render

For on-prem dashboard releases, build the web image per environment with the target NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY, NEXT_PUBLIC_API_BASE_URL, and optional browser Sentry DSN. Do not promote the same web image from integration to production, because those browser-facing values are compiled into the Next.js bundle. For database-backed environments, also run:

pnpm db:migrate
pnpm worker:migrate
pnpm db:smoke

Run pnpm db:seed only for local or isolated demo environments. It creates demo Acme data and is not part of enterprise production bootstrap.

Migration checklist

Before applying migrations:

Confirm the target application version and image tag.
Confirm the latest successful Postgres backup and PITR window.
Capture a pre-upgrade restore point or snapshot when the Postgres platform supports it.
Review SQL migrations for destructive operations and long-running locks.
Run migrations once through the Helm hook job or a controlled CI job, not from multiple terminals.
Run pnpm db:smoke and application health checks after migration.

Rollback policy:

Prefer forward migrations for normal fixes.
Use helm rollback only for application revision rollback when schema remains compatible.
Use PITR/snapshot restore for destructive schema failures.
Do not manually edit schema_migrations except under an incident procedure.

Backups and recovery

Postgres is the source of truth. Back up:

Application tables.
schema_migrations and Graphile Worker schema.
Ledger entries, spend events, reservations, policies, budgets, and key metadata.
Audit records, outbox state, rollups, catalog rows, and sync-run history.

Runtime secrets are not stored in application tables. Recover secrets through the customer secrets manager and rotation process. Define customer-owned RPO/RTO targets per environment. A typical enterprise starting point is:

Environment	RPO	RTO	Notes
Pilot	24 hours	1 business day	Snapshot backup is usually enough.
Production	15-60 minutes	1-4 hours	Requires WAL archiving/PITR and rehearsed restore.
Critical gateway	5-15 minutes	less than 1 hour	Requires HA Postgres, tested failover, and automation.

Restore drill:

Restore Postgres into an isolated environment.
Install the matching Spendra chart revision and image tag.
Run pnpm db:smoke.
Verify dashboard login, ledger reads, rollups, audit log reads, and a non-production governed gateway request.
Document actual RPO/RTO and any manual steps.

Monitoring

Track these signals:

Gateway request rate and latency.
Provider request latency and error rate.
Reservation success, failure, and hard-cap block counts.
Settlement failures.
Missing or estimated usage count.
Outbox lag and worker queue depth.
Graphile queue lag.
Rollup lag.
Database query latency and connection pool utilization.
API and worker exceptions.
Catalog sync status and last successful sync/import time.

The API exposes /metrics for private scraping with SPENDRA_METRICS_BEARER_TOKEN. Public ingress must deny /metrics.

SIEM and audit export

Spendra services emit JSON logs through pino. Forward container stdout/stderr through the customer log pipeline to the SIEM. Preserve these fields when present:

request ID, organization ID, actor ID, API key ID.
provider, model or tool, route, status, error class.
policy result, reservation ID, spend event ID, ledger entry ID.
worker job type, outbox ID, catalog sync status.

Finance and governance audit records live in the audit_log table. Export audit data through controlled database exports or future product export endpoints according to customer retention policy.

Troubleshooting

Gateway call is blocked

Check key status, expiration, actor binding, provider/model/tool scopes, hard-cap policies, and remaining budget. A blocked request should not produce provider spend or a ledger entry.

Ledger is missing a request

Confirm whether the request reached the provider, whether usage was available, whether settlement completed, and whether the worker processed downstream rollups. Check gateway request metadata before assuming finance data is missing.

Dashboard rollups are stale

Check worker health, Graphile queue depth, outbox lag, database connectivity, and recurring rollup job scheduling.

Auth redirects fail

Check dashboard public URL, Auth provider redirect URLs, browser public Auth configuration, and API session verification configuration.

Catalog is stale in an air-gapped environment

Confirm catalog.onlineSync.enabled=false, import the latest offline bundle with pnpm catalog:import --input <file> --database-url <url>, and verify model_catalog_sync_runs and tool_catalog_sync_runs include source offline_bundle.

Provider errors increase

Inspect upstream provider status, gateway egress, provider credential state, request size limits, and timeout configuration. Avoid unsafe automatic retries for LLM generation because retries can double-spend or produce duplicate outputs.

​Release validation

​Migration checklist

​Backups and recovery

​Monitoring

​SIEM and audit export

​Troubleshooting

​Gateway call is blocked

​Ledger is missing a request

​Dashboard rollups are stale

​Auth redirects fail

​Catalog is stale in an air-gapped environment

​Provider errors increase