Observability — OpenTelemetry + Grafana + Loki + Tempo
engineering-docs-operations-observability-setup · in engineering/docs/operations · org-wide · updated 2026-06-01 10:19
Frontmatter
- lang
- en
- imported_at
- 2026-06-01T10:19:43.258Z
- source_path
- productgalaxy/docs/operations/observability-setup.md
- source_repo
- productgalaxy
Observability — OpenTelemetry + Grafana + Loki + Tempo
Self-hosted observability stack co-located on the same VPS. No external SaaS. ~600 MB RAM overhead total. Per CLAUDE.md §15 ops contract.
Stack at a glance
| Component | Role | Port |
|---|---|---|
| OTel SDK | Instruments Galaxy app + MCP server inside the process | (in-process) |
| OTel Collector | Sidecar that receives OTLP from the apps, fans out to Loki + Tempo + Prometheus | 4317 (gRPC), 4318 (HTTP) |
| Loki | Log aggregation (replaces stdout-only Docker logs) | 3100 |
| Tempo | Distributed traces (per-request waterfall + DB query timing) | 3200 |
| Prometheus | Metric storage | 9090 |
| Grafana | Dashboards + alerts | 3001 (internal only — not exposed externally) |
0. Add the stack to docker-compose.yml
otel:
image: otel/opentelemetry-collector-contrib:0.110.0
container_name: galaxy_otel
restart: unless-stopped
command: ["--config=/etc/otelcol-contrib/config.yaml"]
volumes:
- ./docker/otel/config.yaml:/etc/otelcol-contrib/config.yaml:ro
ports:
- "127.0.0.1:4317:4317" # gRPC, app talks to this
- "127.0.0.1:4318:4318" # HTTP fallback
loki:
image: grafana/loki:3.2.0
container_name: galaxy_loki
restart: unless-stopped
command: -config.file=/etc/loki/config.yaml
volumes:
- ./docker/loki/config.yaml:/etc/loki/config.yaml:ro
- /data/loki:/loki
tempo:
image: grafana/tempo:2.6.0
container_name: galaxy_tempo
restart: unless-stopped
command: -config.file=/etc/tempo.yaml
volumes:
- ./docker/tempo/config.yaml:/etc/tempo.yaml:ro
- /data/tempo:/var/tempo
prometheus:
image: prom/prometheus:v3.0.0
container_name: galaxy_prometheus
restart: unless-stopped
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
volumes:
- ./docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- /data/prometheus:/prometheus
grafana:
image: grafana/grafana:11.4.0
container_name: galaxy_grafana
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_INSTALL_PLUGINS: ""
GF_AUTH_ANONYMOUS_ENABLED: "false"
ports:
- "127.0.0.1:3001:3000" # bind to loopback; reach via SSH tunnel only
volumes:
- /data/grafana:/var/lib/grafana
- ./docker/grafana/provisioning:/etc/grafana/provisioning:ro
1. OTel Collector config
docker/otel/config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
exporters:
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
tls:
insecure: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
2. Galaxy app — instrumentation
Add to apps/app/src/api/index.ts (near the top, before any route mounts):
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
if (process.env.GALAXY_ENV !== 'local') {
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel:4317' }),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel:4317' }),
exportIntervalMillis: 10_000,
}),
instrumentations: [getNodeAutoInstrumentations({
// Postgres + HTTP are the high-value bits; skip noisy fs instrumentation.
'@opentelemetry/instrumentation-fs': { enabled: false },
})],
serviceName: 'galaxy-app',
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
}
Same shape for apps/mcp/src/index.ts with serviceName: 'galaxy-mcp'.
Add to package.json:
"@opentelemetry/sdk-node": "^0.55.0",
"@opentelemetry/auto-instrumentations-node": "^0.55.0",
"@opentelemetry/exporter-trace-otlp-grpc": "^0.55.0",
"@opentelemetry/exporter-metrics-otlp-grpc": "^0.55.0"
3. Grafana provisioning
docker/grafana/provisioning/datasources/galaxy.yaml:
apiVersion: 1
datasources:
- name: Loki
type: loki
url: http://loki:3100
access: proxy
- name: Tempo
type: tempo
url: http://tempo:3200
access: proxy
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
isDefault: true
Pre-built dashboards: docker/grafana/provisioning/dashboards/galaxy/{app-latency.json,importer-status.json,audit-chain.json} — drop in committed JSON exports.
4. Alerts (Prometheus → Grafana)
docker/grafana/provisioning/alerting/rules.yaml:
apiVersion: 1
groups:
- orgId: 1
name: galaxy-prod
folder: galaxy
interval: 1m
rules:
- uid: galaxy-5xx-rate
title: "5xx rate > 1% over 5m"
condition: B
data:
- refId: A
datasourceUid: prometheus
model:
expr: sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
instant: false
intervalMs: 60000
- refId: B
datasourceUid: __expr__
model:
type: threshold
expression: A
conditions: [{ evaluator: { params: [0.01], type: gt } }]
for: 5m
labels: { severity: critical }
annotations:
summary: "Galaxy is serving 5xx errors above the SLO. Investigate immediately."
- uid: galaxy-p95-latency
title: "p95 latency > 500ms over 5m"
# ... same shape, histogram_quantile(0.95, ...) > 0.5
- uid: galaxy-importer-failure
title: "Importer hasn't run in 25h"
# ... time() - max(galaxy_importer_last_run_unix_seconds) > 90000
Notify via email + webhook (Slack / Discord / Telegram).
5. Access Grafana from your laptop (no public exposure)
# On your laptop:
ssh -L 3001:localhost:3001 galaxy@<vps-ip>
open http://localhost:3001
# Login: admin / <GRAFANA_ADMIN_PASSWORD from /etc/galaxy/.env>
6. Verify the pipeline
# Generate a request:
curl -s http://localhost:3000/api/v1/openapi.json > /dev/null
# Check Loki saw the log line:
docker exec galaxy_loki wget -q -O - 'http://localhost:3100/loki/api/v1/labels'
# Check Tempo has a trace for that request (open Grafana → Explore → Tempo, search service=galaxy-app)
Troubleshooting
- No traces in Tempo: app didn't pick up
OTEL_EXPORTER_OTLP_ENDPOINT. Set in/etc/galaxy/.envand restart. - High Loki disk usage: tune
chunk_target_size+max_streams_per_userindocker/loki/config.yaml; default retention is 30d. - Grafana resets on container recreate: missing
/data/grafanavolume mount.