← Galaxy / notesorg-wide / engineering-docs-operations-observability-setup

Observability — OpenTelemetry + Grafana + Loki + Tempo

engineering-docs-operations-observability-setup · in engineering/docs/operations · org-wide · updated 2026-06-01 10:19

Frontmatter

lang
en
imported_at
2026-06-01T10:19:43.258Z
source_path
productgalaxy/docs/operations/observability-setup.md
source_repo
productgalaxy

Observability — OpenTelemetry + Grafana + Loki + Tempo

Self-hosted observability stack co-located on the same VPS. No external SaaS. ~600 MB RAM overhead total. Per CLAUDE.md §15 ops contract.

Stack at a glance

Component Role Port
OTel SDK Instruments Galaxy app + MCP server inside the process (in-process)
OTel Collector Sidecar that receives OTLP from the apps, fans out to Loki + Tempo + Prometheus 4317 (gRPC), 4318 (HTTP)
Loki Log aggregation (replaces stdout-only Docker logs) 3100
Tempo Distributed traces (per-request waterfall + DB query timing) 3200
Prometheus Metric storage 9090
Grafana Dashboards + alerts 3001 (internal only — not exposed externally)

0. Add the stack to docker-compose.yml

otel:
  image: otel/opentelemetry-collector-contrib:0.110.0
  container_name: galaxy_otel
  restart: unless-stopped
  command: ["--config=/etc/otelcol-contrib/config.yaml"]
  volumes:
    - ./docker/otel/config.yaml:/etc/otelcol-contrib/config.yaml:ro
  ports:
    - "127.0.0.1:4317:4317"   # gRPC, app talks to this
    - "127.0.0.1:4318:4318"   # HTTP fallback

loki:
  image: grafana/loki:3.2.0
  container_name: galaxy_loki
  restart: unless-stopped
  command: -config.file=/etc/loki/config.yaml
  volumes:
    - ./docker/loki/config.yaml:/etc/loki/config.yaml:ro
    - /data/loki:/loki

tempo:
  image: grafana/tempo:2.6.0
  container_name: galaxy_tempo
  restart: unless-stopped
  command: -config.file=/etc/tempo.yaml
  volumes:
    - ./docker/tempo/config.yaml:/etc/tempo.yaml:ro
    - /data/tempo:/var/tempo

prometheus:
  image: prom/prometheus:v3.0.0
  container_name: galaxy_prometheus
  restart: unless-stopped
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.retention.time=30d'
  volumes:
    - ./docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    - /data/prometheus:/prometheus

grafana:
  image: grafana/grafana:11.4.0
  container_name: galaxy_grafana
  restart: unless-stopped
  environment:
    GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
    GF_INSTALL_PLUGINS: ""
    GF_AUTH_ANONYMOUS_ENABLED: "false"
  ports:
    - "127.0.0.1:3001:3000"   # bind to loopback; reach via SSH tunnel only
  volumes:
    - /data/grafana:/var/lib/grafana
    - ./docker/grafana/provisioning:/etc/grafana/provisioning:ro

1. OTel Collector config

docker/otel/config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s

exporters:
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    tls:
      insecure: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

2. Galaxy app — instrumentation

Add to apps/app/src/api/index.ts (near the top, before any route mounts):

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

if (process.env.GALAXY_ENV !== 'local') {
  const sdk = new NodeSDK({
    traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel:4317' }),
    metricReader: new PeriodicExportingMetricReader({
      exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel:4317' }),
      exportIntervalMillis: 10_000,
    }),
    instrumentations: [getNodeAutoInstrumentations({
      // Postgres + HTTP are the high-value bits; skip noisy fs instrumentation.
      '@opentelemetry/instrumentation-fs': { enabled: false },
    })],
    serviceName: 'galaxy-app',
  });
  sdk.start();
  process.on('SIGTERM', () => sdk.shutdown());
}

Same shape for apps/mcp/src/index.ts with serviceName: 'galaxy-mcp'.

Add to package.json:

"@opentelemetry/sdk-node": "^0.55.0",
"@opentelemetry/auto-instrumentations-node": "^0.55.0",
"@opentelemetry/exporter-trace-otlp-grpc": "^0.55.0",
"@opentelemetry/exporter-metrics-otlp-grpc": "^0.55.0"

3. Grafana provisioning

docker/grafana/provisioning/datasources/galaxy.yaml:

apiVersion: 1
datasources:
  - name: Loki
    type: loki
    url: http://loki:3100
    access: proxy
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    access: proxy
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    access: proxy
    isDefault: true

Pre-built dashboards: docker/grafana/provisioning/dashboards/galaxy/{app-latency.json,importer-status.json,audit-chain.json} — drop in committed JSON exports.

4. Alerts (Prometheus → Grafana)

docker/grafana/provisioning/alerting/rules.yaml:

apiVersion: 1
groups:
  - orgId: 1
    name: galaxy-prod
    folder: galaxy
    interval: 1m
    rules:
      - uid: galaxy-5xx-rate
        title: "5xx rate > 1% over 5m"
        condition: B
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
              instant: false
              intervalMs: 60000
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions: [{ evaluator: { params: [0.01], type: gt } }]
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Galaxy is serving 5xx errors above the SLO. Investigate immediately."

      - uid: galaxy-p95-latency
        title: "p95 latency > 500ms over 5m"
        # ... same shape, histogram_quantile(0.95, ...) > 0.5

      - uid: galaxy-importer-failure
        title: "Importer hasn't run in 25h"
        # ... time() - max(galaxy_importer_last_run_unix_seconds) > 90000

Notify via email + webhook (Slack / Discord / Telegram).

5. Access Grafana from your laptop (no public exposure)

# On your laptop:
ssh -L 3001:localhost:3001 galaxy@<vps-ip>
open http://localhost:3001
# Login: admin / <GRAFANA_ADMIN_PASSWORD from /etc/galaxy/.env>

6. Verify the pipeline

# Generate a request:
curl -s http://localhost:3000/api/v1/openapi.json > /dev/null

# Check Loki saw the log line:
docker exec galaxy_loki wget -q -O - 'http://localhost:3100/loki/api/v1/labels'

# Check Tempo has a trace for that request (open Grafana → Explore → Tempo, search service=galaxy-app)

Troubleshooting

  • No traces in Tempo: app didn't pick up OTEL_EXPORTER_OTLP_ENDPOINT. Set in /etc/galaxy/.env and restart.
  • High Loki disk usage: tune chunk_target_size + max_streams_per_user in docker/loki/config.yaml; default retention is 30d.
  • Grafana resets on container recreate: missing /data/grafana volume mount.

Outbound links (0)

This note doesn't reference any other entity.

Version history (1)

  • v12026-06-01 10:19"galaxy-docs importer: initial import"