Galaxy production deploy runbook
engineering-docs-operations-deploy-runbook · in engineering/docs/operations · org-wide · updated 2026-06-01 10:19
Frontmatter
- lang
- en
- imported_at
- 2026-06-01T10:19:43.141Z
- source_path
- productgalaxy/docs/operations/DEPLOY-RUNBOOK.md
- source_repo
- productgalaxy
Galaxy production deploy runbook
A single, plain-English walkthrough for taking productgalaxy live on a Hetzner VPS. Designed for a non-technical operator with the Claude Code assistant.
Before you start
You need:
- A Hetzner Cloud account + one CCX23 (or larger) VPS provisioned with Ubuntu 24.04.
- SSH access to the VPS as
root(later we lock this down). - A domain (e.g.
galaxy.example.com) with A/AAAA records pointed at the VPS IP. Three subdomains:galaxy.example.com— admin + APImcp.galaxy.example.com— MCP serverdocs.galaxy.example.com— reading site
- A Backblaze B2 account + one bucket named
galaxy-pgbackrest(encrypted backups). - One generated production secrets file at
./secrets/galaxy.yaml.enc(SOPS-encrypted; see "Secrets" below).
Step 1 — VPS bootstrap (one-off)
SSH into the box and run:
ssh root@<vps-ip>
apt update && apt -y upgrade
apt install -y ca-certificates curl gnupg lsb-release
# Docker Engine (from the official repo)
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
> /etc/apt/sources.list.d/docker.list
apt update
apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# A non-root user that owns the deploy directory
useradd --create-home --shell /bin/bash --groups docker galaxy
mkdir -p /srv/galaxy /etc/galaxy/secrets /var/lib/galaxy/attachments
chown -R galaxy:galaxy /srv/galaxy /var/lib/galaxy
chmod 700 /etc/galaxy/secrets
# Firewall — only SSH + HTTP(S) inbound
ufw default deny incoming
ufw default allow outgoing
ufw allow 22/tcp
ufw allow 80/tcp
ufw allow 443/tcp
ufw allow 443/udp # HTTP/3 over QUIC
ufw --force enable
Step 2 — Push the secrets
Locally, encrypt your production secrets with SOPS + age (or a YubiKey, or a KMS). The example file .env.production.example shows every key you need to set.
# Locally:
sops --age <your-pubkey> --encrypt secrets/galaxy.yaml > secrets/galaxy.yaml.enc
# On first deploy, decrypt + push to the VPS (mode 600, owner root):
sops -d secrets/galaxy.yaml.enc | ssh root@<vps-ip> \
'umask 077 && cat > /etc/galaxy/secrets/galaxy.yaml && chmod 600 /etc/galaxy/secrets/galaxy.yaml'
The three pgBackRest secrets are simpler one-line files:
echo "$B2_KEY" | ssh root@<vps-ip> 'umask 077 && cat > /etc/galaxy/secrets/b2_key.txt'
echo "$B2_SECRET" | ssh root@<vps-ip> 'umask 077 && cat > /etc/galaxy/secrets/b2_secret.txt'
openssl rand -base64 64 | ssh root@<vps-ip> 'umask 077 && cat > /etc/galaxy/secrets/pgbackrest_cipher_pass.txt'
echo "$POSTGRES_PASSWORD" | ssh root@<vps-ip> 'umask 077 && cat > /etc/galaxy/secrets/postgres_password.txt'
Step 3 — Pin image digests (12-Factor II + V)
Edit docker-compose.prod.yml and replace each @sha256:REPLACE_WITH_DIGEST_BEFORE_FIRST_PROD_DEPLOY with a real digest. To get them:
docker pull pgvector/pgvector:pg17 && docker inspect pgvector/pgvector:pg17 --format '{{index .RepoDigests 0}}'
docker pull caddy:2.10-alpine && docker inspect caddy:2.10-alpine --format '{{index .RepoDigests 0}}'
docker pull pgbackrest/pgbackrest:2.55 && docker inspect pgbackrest/pgbackrest:2.55 --format '{{index .RepoDigests 0}}'
Galaxy's own images (ghcr.io/sabaidea/galaxy-app, …-mcp, …-docs) build from this repo. Push them via GitHub Actions on every tagged release; the workflow stamps the digest into a PROD_TAG env var that the compose file reads.
Step 4 — First deploy
ssh galaxy@<vps-ip>
cd /srv/galaxy
git clone https://github.com/parhumm/productgalaxy.git .
cd /srv/galaxy
# Build local images (first time only — subsequent deploys pull pre-built from GHCR)
docker compose -f docker-compose.yml -f docker-compose.prod.yml build
# Bring up postgres + pgbackrest only first
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d postgres pgbackrest
# Run migrations as a one-off process (12-Factor XII)
docker compose -f docker-compose.yml -f docker-compose.prod.yml run --rm app \
node packages/db/dist/migrate.js
# Seed the canonical taxonomies + first admin user
docker compose -f docker-compose.yml -f docker-compose.prod.yml run --rm app \
node packages/db/dist/seed/run.js
# Then bring up the rest
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
Wait 60 seconds for healthchecks to settle, then:
docker compose -f docker-compose.yml -f docker-compose.prod.yml ps
# Every row should say "running (healthy)"
Step 5 — Smoke-test from anywhere
Locally:
GALAXY_BASE=https://galaxy.example.com \
GALAXY_MCP_BASE=https://mcp.galaxy.example.com \
GALAXY_DOCS_BASE=https://docs.galaxy.example.com \
GALAXY_ADMIN_JWT="$(pnpm --filter @galaxy/db exec tsx tools/issue-admin-jwt.ts)" \
./scripts/smoke-test.sh
Expected: 12 green PASS lines, ending with ✓ smoke test passed.
Step 6 — Verify against live data
Once data is imported (see docs/operations/IMPORTERS.md):
# Per-domain data parity (byte equality vs source files)
pnpm --filter @galaxy/importer-audits run verify:data-parity
pnpm --filter @galaxy/importer-pm run verify:data-parity
pnpm --filter @galaxy/importer-abtests run verify:data-parity
pnpm --filter @galaxy/importer-comments run verify:data-parity
# Domain invariants
pnpm --filter @galaxy/importer-audits run verify:walkthrough-parity
pnpm --filter @galaxy/importer-audits run verify:content-parity
pnpm --filter @galaxy/importer-pm run verify:cross-link-parity
pnpm --filter @galaxy/importer-comments run verify:insights-parity
# Integration suite (runs against the live DB)
pnpm --filter @galaxy/tests-integration test
# E2E suite (requires Playwright browsers — one-off install on the deploy box)
pnpm --filter @galaxy/tests-e2e install:browsers
pnpm --filter @galaxy/tests-e2e test
A green run on all of the above is the cutover gate for the legacy app teams.
Step 7 — Day-two operations
| Task | Command |
|---|---|
| Roll a new release | Push a tag → GHA builds + pushes images to GHCR → docker compose pull && docker compose up -d on the box |
| Restore from backup | See pgbackrest-restore.md — full / diff / time-PITR all documented |
| Rotate the admin JWT | pnpm --filter @galaxy/db exec tsx tools/issue-admin-jwt.ts > new.jwt, replace in secrets file, restart app + mcp |
| Read prod logs | docker compose logs -f app mcp — 12-Factor XI: stdout streams |
| Health probe from a Slack alert | curl -fsS https://galaxy.example.com/healthz?deep=1 |
Rollback
The deploy workflow keeps the previous image digest pinned in a tag file. Rollback is:
ssh galaxy@<vps-ip>
cd /srv/galaxy
git checkout HEAD~1 -- docker-compose.prod.yml # restore the previous digest pin
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
./scripts/smoke-test.sh
The DB never moves backward on rollback — migrations are append-only per CLAUDE.md §6. Restoring data is a separate pgBackRest PITR run, not a rollback.