Operations

This page covers what operating Unphish v2 looks like day-to-day and during incidents.

Health checks and readiness

Every environment exposes readiness signals at /hub/readiness. The signals cover:

Database. Connectivity, pool health, migration version vs. expected, recent error rate.
Authentik. OIDC discovery reachable, admin token valid, recent auth failures within tolerance.
Providers. Per-provider mode (live / fixture / sandbox / unconfigured) and last-call status.
Worker. Render worker heartbeat, Temporal task-queue lag, recent retries and DLQ counts.
Storage. S3 / Vercel Blob reachable; recent upload success.
Queues and schedules. Verification schedule on time; report schedule on time.

A green readiness page is a precondition for promoting deployments. An amber/red signal blocks promotion until resolved or an explicit override (capability-gated, audited) is recorded.

Observability

Application logs

Structured JSON logs with subject IDs (user, organization, client, case) but no payload bodies.
Errors include stack traces and request metadata.
Access logs are aggregated at the Vercel layer; analyst-facing surfaces produce specific events for important transitions.

Worker logs

Per-workflow run / step IDs for traceability.
Activity start/finish events with duration and retry counts.
Provider request/response references (the actual payloads are stored in S3 + Postgres; logs reference paths).

Metrics we watch

Sign-in success rate per environment.
Auth failure rate per environment (sustained spike = potential phishing campaign or Authentik issue).
Provider call success rate per provider per environment.
Workflow retry rate, failure rate, DLQ count.
Verification schedule on-time percentage.
Migration phase status (during cutover).
API response latency by route family.

Alerting

Alerts fire to operations on-call for:

Readiness red.
Authentik admin token failure.
Provider error rate over threshold for any live provider.
Worker heartbeat loss.
DLQ growth.
Migration phase failure.
Backup completion failure.

Backups and restore

Component	Backup mechanism	Restore RPO/RTO target
Postgres (Neon)	Managed continuous backup with point-in-time recovery.	RPO ≤ 5 min; RTO ≤ 1 hour for a known good point in time.
Object storage (S3)	Versioning + lifecycle. No deletes by default.	RPO ≈ 0 for current versions; older versions per lifecycle.
Authentik	Self-managed Postgres backups for the Authentik host. Configuration as code where possible.	Tested per restore drill cadence.
Temporal Cloud	Managed by Temporal Cloud; retention per the chosen tier.	Per Temporal Cloud SLA.

Restore procedures are documented in the production-readiness runbook (docs/production-readiness-runbook.md). They are exercised periodically to validate they actually work; an untested restore is not a backup.

Migration runbooks

For v1 → v2 migration:

Audit before import. Run the read-only v1 audit (pnpm audit:v1-dump) and produce a migration report with row counts, distinct values, orphan checks, and file inventory.
Stream high-volume tables. unphish_whoisfreaks_nrd_domain (1.45M rows) and unphish_scan_query (250K+ rows) require streaming with checkpoints and resumability.
Phase the import. Phase 1 (clients/brands/users) before Phase 2 (cases/notes) before Phase 3 (enforcements) before Phase 4 (watchlist) before Phase 5 (whitelist/scans/quotas/API/SSO) before Phase 6 (validation).
Preserve legacy IDs. Every imported record carries legacy_v1_* for traceability.
Validate after each phase. Row counts match; required foreign keys resolve; enum values are normalized; orphans are quantified.
Label imported data. Imported records carry source: imported and the UI labels them.
Approve the migration report before promoting the imported data to production.

The migration is resumable. Failed batches roll back and re-run; checkpoints persist so we never re-import already-imported records.

Deployment

Pull request

Vercel creates a preview deployment for every PR. The preview is the closest reproduction of production we have for short-lived testing. Run the route smoke and the local parity smoke against the preview before merging.

Staging

Merging to main deploys to staging. Staging shares the same routes as production but uses staging environment variables, fixture/sandbox provider modes, and a staging database / branch.

The post-merge checklist is in Release gates → Promotion to staging.

Production

Production promotion is manual. The release manager:

Confirms the staging build is healthy.
Runs through the production gate checklist.
Promotes via Vercel.
Watches readiness, recent activity, and worker health for 30 minutes post-deploy.
Verifies first real user sign-in succeeded.
Documents the deployment in the activity log.

Rollback is documented per release; for a fast revert, redeploy the prior commit. For a migration that must be undone, follow the documented migration rollback (per phase).

Incident response

Severity

Sev 1. Customer-impacting outage or data integrity issue. Examples: sign-in down, dashboard returns wrong tenant's data, enforcement submission fails platform-wide, audit log write failures. Page on-call immediately.
Sev 2. Significant degradation. Examples: one provider down, one tenant cannot load a specific surface, scheduled report failed to deliver. On-call notified during business hours.
Sev 3. Minor degradation or known-issue tracking. Examples: cosmetic UI bug, single non-critical workflow retry, intermittent provider warning. Filed and triaged.

First response

For Sev 1:

Acknowledge in the incident channel within 5 minutes.
Page any required additional responders.
Capture initial state. Take screenshots; record current versions; note recent deploys.
Mitigate. If a recent deploy is implicated, roll back. If a provider is implicated, switch to fixture/sandbox mode for that provider while investigating.
Communicate. Update the status surface; notify affected customers if customer-impacting.
Resolve. Address the root cause once stabilized.
Postmortem. Within 5 business days for any Sev 1.

Common runbooks

Symptom	First action
All sign-ins failing	Check Authentik readiness; check `AUTH_AUTHENTIK_*` env vars; check OIDC discovery reachable.
Dashboard loading "unavailable" everywhere	Check `/hub/readiness` → database tile; check pool exhaustion; check recent migration.
One provider failing	Check `/hub/secrets` → provider last-check; rotate credential if compromised; switch to fixture mode while triaging.
Worker stopped processing	Check Render worker heartbeat; check Temporal Cloud namespace; restart the worker process.
Migration phase failed	Check `/admin/imports` → the run; review error; address root cause; resume from checkpoint.
Customer reports wrong-tenant data	Treat as Sev 1; check audit log; isolate the affected route; verify tenant scoping at the data-access layer.
Provider returned partial / rejected at scale	Check classification confidence and evidence quality for affected cases; check provider-side policy changes; consider escalation.

Post-incident

Postmortem document with timeline, root cause, contributing factors, action items, and a parity matrix entry if behavior is now expected to be different.
Action items tracked in the Delivery board with owners and dates.
Customer communication for affected tenants, scoped to what they need to know.

Routine maintenance

Task	Cadence
Restore drill (Postgres, evidence, Authentik)	Quarterly
Provider credential rotation	Per provider policy; minimum annually
Authentik admin token rotation	On Authentik `secret_rotate` events; minimum semi-annually
Capability/role audit	Quarterly
Audit log review (sample)	Monthly
Backup verification	Monthly
Dependency updates (security)	As security advisories warrant; minimum monthly

Capacity planning

Inputs we track:

Cases / day per tenant.
Enforcement submissions / day per provider.
Verification check load (4-hour cadence × active cases).
Watchlist enrichment load (per-watchlist cadence × subscribed items).
Scheduled report generation load.
Migration backfill load (during cutover phases).

These feed Render worker count, Temporal task-queue concurrency, and database pool sizing. Increases trigger an environment configuration change rather than ad-hoc scaling.

Data subject requests

When a customer or end-user requests access, export, or deletion:

Verify the requester through the customer's admin channel.
Scope the request to the affected records.
Execute via admin surfaces — export, soft-delete, or hard-delete with the documented two-person approval where required.
Audit the action with reason and approver.
Confirm completion to the requester.

Hard deletion is exceptional; default is soft deletion preserving audit history.