Operations
Runbooks, observability, backups, and incident response.
This page covers what operating Unphish v2 looks like day-to-day and during incidents.
Health checks and readiness
Every environment exposes readiness signals at /hub/readiness. The signals cover:
- Database. Connectivity, pool health, migration version vs. expected, recent error rate.
- Authentik. OIDC discovery reachable, admin token valid, recent auth failures within tolerance.
- Providers. Per-provider mode (
live/fixture/sandbox/unconfigured) and last-call status. - Worker. Render worker heartbeat, Temporal task-queue lag, recent retries and DLQ counts.
- Storage. S3 / Vercel Blob reachable; recent upload success.
- Queues and schedules. Verification schedule on time; report schedule on time.
A green readiness page is a precondition for promoting deployments. An amber/red signal blocks promotion until resolved or an explicit override (capability-gated, audited) is recorded.
Observability
Application logs
- Structured JSON logs with subject IDs (user, organization, client, case) but no payload bodies.
- Errors include stack traces and request metadata.
- Access logs are aggregated at the Vercel layer; analyst-facing surfaces produce specific events for important transitions.
Worker logs
- Per-workflow run / step IDs for traceability.
- Activity start/finish events with duration and retry counts.
- Provider request/response references (the actual payloads are stored in S3 + Postgres; logs reference paths).
Metrics we watch
- Sign-in success rate per environment.
- Auth failure rate per environment (sustained spike = potential phishing campaign or Authentik issue).
- Provider call success rate per provider per environment.
- Workflow retry rate, failure rate, DLQ count.
- Verification schedule on-time percentage.
- Migration phase status (during cutover).
- API response latency by route family.
Alerting
Alerts fire to operations on-call for:
- Readiness red.
- Authentik admin token failure.
- Provider error rate over threshold for any
liveprovider. - Worker heartbeat loss.
- DLQ growth.
- Migration phase failure.
- Backup completion failure.
Backups and restore
| Component | Backup mechanism | Restore RPO/RTO target |
|---|---|---|
| Postgres (Neon) | Managed continuous backup with point-in-time recovery. | RPO ≤ 5 min; RTO ≤ 1 hour for a known good point in time. |
| Object storage (S3) | Versioning + lifecycle. No deletes by default. | RPO ≈ 0 for current versions; older versions per lifecycle. |
| Authentik | Self-managed Postgres backups for the Authentik host. Configuration as code where possible. | Tested per restore drill cadence. |
| Temporal Cloud | Managed by Temporal Cloud; retention per the chosen tier. | Per Temporal Cloud SLA. |
Restore procedures are documented in the production-readiness runbook (docs/production-readiness-runbook.md). They are exercised periodically to validate they actually work; an untested restore is not a backup.
Migration runbooks
For v1 → v2 migration:
- Audit before import. Run the read-only v1 audit (
pnpm audit:v1-dump) and produce a migration report with row counts, distinct values, orphan checks, and file inventory. - Stream high-volume tables.
unphish_whoisfreaks_nrd_domain(1.45M rows) andunphish_scan_query(250K+ rows) require streaming with checkpoints and resumability. - Phase the import. Phase 1 (clients/brands/users) before Phase 2 (cases/notes) before Phase 3 (enforcements) before Phase 4 (watchlist) before Phase 5 (whitelist/scans/quotas/API/SSO) before Phase 6 (validation).
- Preserve legacy IDs. Every imported record carries
legacy_v1_*for traceability. - Validate after each phase. Row counts match; required foreign keys resolve; enum values are normalized; orphans are quantified.
- Label imported data. Imported records carry
source: importedand the UI labels them. - Approve the migration report before promoting the imported data to production.
The migration is resumable. Failed batches roll back and re-run; checkpoints persist so we never re-import already-imported records.
Deployment
Pull request
Vercel creates a preview deployment for every PR. The preview is the closest reproduction of production we have for short-lived testing. Run the route smoke and the local parity smoke against the preview before merging.
Staging
Merging to main deploys to staging. Staging shares the same routes as production but uses staging environment variables, fixture/sandbox provider modes, and a staging database / branch.
The post-merge checklist is in Release gates → Promotion to staging.
Production
Production promotion is manual. The release manager:
- Confirms the staging build is healthy.
- Runs through the production gate checklist.
- Promotes via Vercel.
- Watches readiness, recent activity, and worker health for 30 minutes post-deploy.
- Verifies first real user sign-in succeeded.
- Documents the deployment in the activity log.
Rollback is documented per release; for a fast revert, redeploy the prior commit. For a migration that must be undone, follow the documented migration rollback (per phase).
Incident response
Severity
- Sev 1. Customer-impacting outage or data integrity issue. Examples: sign-in down, dashboard returns wrong tenant's data, enforcement submission fails platform-wide, audit log write failures. Page on-call immediately.
- Sev 2. Significant degradation. Examples: one provider down, one tenant cannot load a specific surface, scheduled report failed to deliver. On-call notified during business hours.
- Sev 3. Minor degradation or known-issue tracking. Examples: cosmetic UI bug, single non-critical workflow retry, intermittent provider warning. Filed and triaged.
First response
For Sev 1:
- Acknowledge in the incident channel within 5 minutes.
- Page any required additional responders.
- Capture initial state. Take screenshots; record current versions; note recent deploys.
- Mitigate. If a recent deploy is implicated, roll back. If a provider is implicated, switch to fixture/sandbox mode for that provider while investigating.
- Communicate. Update the status surface; notify affected customers if customer-impacting.
- Resolve. Address the root cause once stabilized.
- Postmortem. Within 5 business days for any Sev 1.
Common runbooks
| Symptom | First action |
|---|---|
| All sign-ins failing | Check Authentik readiness; check AUTH_AUTHENTIK_* env vars; check OIDC discovery reachable. |
| Dashboard loading "unavailable" everywhere | Check /hub/readiness → database tile; check pool exhaustion; check recent migration. |
| One provider failing | Check /hub/secrets → provider last-check; rotate credential if compromised; switch to fixture mode while triaging. |
| Worker stopped processing | Check Render worker heartbeat; check Temporal Cloud namespace; restart the worker process. |
| Migration phase failed | Check /admin/imports → the run; review error; address root cause; resume from checkpoint. |
| Customer reports wrong-tenant data | Treat as Sev 1; check audit log; isolate the affected route; verify tenant scoping at the data-access layer. |
| Provider returned partial / rejected at scale | Check classification confidence and evidence quality for affected cases; check provider-side policy changes; consider escalation. |
Post-incident
- Postmortem document with timeline, root cause, contributing factors, action items, and a parity matrix entry if behavior is now expected to be different.
- Action items tracked in the Delivery board with owners and dates.
- Customer communication for affected tenants, scoped to what they need to know.
Routine maintenance
| Task | Cadence |
|---|---|
| Restore drill (Postgres, evidence, Authentik) | Quarterly |
| Provider credential rotation | Per provider policy; minimum annually |
| Authentik admin token rotation | On Authentik secret_rotate events; minimum semi-annually |
| Capability/role audit | Quarterly |
| Audit log review (sample) | Monthly |
| Backup verification | Monthly |
| Dependency updates (security) | As security advisories warrant; minimum monthly |
Capacity planning
Inputs we track:
- Cases / day per tenant.
- Enforcement submissions / day per provider.
- Verification check load (4-hour cadence × active cases).
- Watchlist enrichment load (per-watchlist cadence × subscribed items).
- Scheduled report generation load.
- Migration backfill load (during cutover phases).
These feed Render worker count, Temporal task-queue concurrency, and database pool sizing. Increases trigger an environment configuration change rather than ad-hoc scaling.
Data subject requests
When a customer or end-user requests access, export, or deletion:
- Verify the requester through the customer's admin channel.
- Scope the request to the affected records.
- Execute via admin surfaces — export, soft-delete, or hard-delete with the documented two-person approval where required.
- Audit the action with reason and approver.
- Confirm completion to the requester.
Hard deletion is exceptional; default is soft deletion preserving audit history.