This page defines the operational checks for managed and on-prem Spendra deployments. In enterprise on-prem, the customer operates Kubernetes, Postgres, ingress/TLS, backups, secrets, observability, and SIEM forwarding.Documentation Index
Fetch the complete documentation index at: https://docs.cynsta.com/llms.txt
Use this file to discover all available pages before exploring further.
Release validation
Before promoting a deployment:NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY, NEXT_PUBLIC_API_BASE_URL, and optional browser Sentry DSN. Do not promote the same web image from integration to production, because those browser-facing values are compiled into the Next.js bundle.
For database-backed environments, also run:
pnpm db:seed only for local or isolated demo environments. It creates demo Acme data and is not part of enterprise production bootstrap.
Migration checklist
Before applying migrations:- Confirm the target application version and image tag.
- Confirm the latest successful Postgres backup and PITR window.
- Capture a pre-upgrade restore point or snapshot when the Postgres platform supports it.
- Review SQL migrations for destructive operations and long-running locks.
- Run migrations once through the Helm hook job or a controlled CI job, not from multiple terminals.
- Run
pnpm db:smokeand application health checks after migration.
- Prefer forward migrations for normal fixes.
- Use
helm rollbackonly for application revision rollback when schema remains compatible. - Use PITR/snapshot restore for destructive schema failures.
- Do not manually edit
schema_migrationsexcept under an incident procedure.
Backups and recovery
Postgres is the source of truth. Back up:- Application tables.
schema_migrationsand Graphile Worker schema.- Ledger entries, spend events, reservations, policies, budgets, and key metadata.
- Audit records, outbox state, rollups, catalog rows, and sync-run history.
| Environment | RPO | RTO | Notes |
|---|---|---|---|
| Pilot | 24 hours | 1 business day | Snapshot backup is usually enough. |
| Production | 15-60 minutes | 1-4 hours | Requires WAL archiving/PITR and rehearsed restore. |
| Critical gateway | 5-15 minutes | less than 1 hour | Requires HA Postgres, tested failover, and automation. |
- Restore Postgres into an isolated environment.
- Install the matching Spendra chart revision and image tag.
- Run
pnpm db:smoke. - Verify dashboard login, ledger reads, rollups, audit log reads, and a non-production governed gateway request.
- Document actual RPO/RTO and any manual steps.
Monitoring
Track these signals:- Gateway request rate and latency.
- Provider request latency and error rate.
- Reservation success, failure, and hard-cap block counts.
- Settlement failures.
- Missing or estimated usage count.
- Outbox lag and worker queue depth.
- Graphile queue lag.
- Rollup lag.
- Database query latency and connection pool utilization.
- API and worker exceptions.
- Catalog sync status and last successful sync/import time.
/metrics for private scraping with SPENDRA_METRICS_BEARER_TOKEN. Public ingress must deny /metrics.
SIEM and audit export
Spendra services emit JSON logs through pino. Forward container stdout/stderr through the customer log pipeline to the SIEM. Preserve these fields when present:- request ID, organization ID, actor ID, API key ID.
- provider, model or tool, route, status, error class.
- policy result, reservation ID, spend event ID, ledger entry ID.
- worker job type, outbox ID, catalog sync status.
audit_log table. Export audit data through controlled database exports or future product export endpoints according to customer retention policy.
Troubleshooting
Gateway call is blocked
Check key status, expiration, actor binding, provider/model/tool scopes, hard-cap policies, and remaining budget. A blocked request should not produce provider spend or a ledger entry.Ledger is missing a request
Confirm whether the request reached the provider, whether usage was available, whether settlement completed, and whether the worker processed downstream rollups. Check gateway request metadata before assuming finance data is missing.Dashboard rollups are stale
Check worker health, Graphile queue depth, outbox lag, database connectivity, and recurring rollup job scheduling.Auth redirects fail
Check dashboard public URL, Auth provider redirect URLs, browser public Auth configuration, and API session verification configuration.Catalog is stale in an air-gapped environment
Confirmcatalog.onlineSync.enabled=false, import the latest offline bundle with pnpm catalog:import --input <file> --database-url <url>, and verify model_catalog_sync_runs and tool_catalog_sync_runs include source offline_bundle.