Setting Up OpenTelemetry Observability for Backstage

A production Backstage instance is a multi-plugin backend whose slowest path — usually catalog ingestion or permission evaluation — is invisible without distributed tracing. This how-to wires OpenTelemetry into the Backstage backend, ships traces and metrics through an OTel Collector, builds catalog-latency dashboards, and routes alerts to PagerDuty and Slack. It extends the operational guidance in the Backstage Architecture Deep Dive within the broader Developer Portal Architecture & Frameworks strategy.

Prerequisites

  • Backstage on @backstage/backend-defaults 0.5.0+ (Node.js 20+).
  • An OpenTelemetry Collector 0.102.0+ reachable from the backend pods.
  • A metrics backend (Prometheus 2.52+) and a tracing backend (Tempo or Jaeger 1.57+).
  • Alertmanager 0.27+ with PagerDuty and Slack receivers configured.
  • Cluster permissions to deploy the Collector as a sidecar or DaemonSet.
# Requires Node.js >= 20
yarn --cwd packages/backend add \
  @opentelemetry/[email protected] \
  @opentelemetry/[email protected] \
  @opentelemetry/[email protected] \
  @opentelemetry/[email protected]

Exact Configuration

  1. Initialize the SDK before the backend boots. Backstage exposes Prometheus-style metrics already; OpenTelemetry adds traces and lets you export both over OTLP. Create packages/backend/src/instrumentation.ts and import it first in index.ts.

    // packages/backend/src/instrumentation.ts
    // Requires @opentelemetry/sdk-node >= 0.52.0
    import { NodeSDK } from '@opentelemetry/sdk-node';
    import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
    import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
    import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
    import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
    import { resourceFromAttributes } from '@opentelemetry/resources';
    import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
    
    const sdk = new NodeSDK({
      resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: 'backstage-backend' }),
      traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
      metricReader: new PeriodicExportingMetricReader({
        exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
        exportIntervalMillis: 15000,
      }),
      instrumentations: [getNodeAutoInstrumentations()],
    });
    sdk.start();
    
    // packages/backend/src/index.ts — must be the FIRST import
    import './instrumentation';
    import { createBackend } from '@backstage/backend-defaults';
    const backend = createBackend();
    // ...register plugins
    backend.start();
    

    The auto-instrumentations cover Express, the PostgreSQL driver, and outbound HTTP — which is precisely the catalog-ingestion path you want visibility into.

  2. Deploy the OTel Collector to receive OTLP, batch it, and fan out to Tempo and Prometheus.

    # otel-collector-config.yaml — requires otelcol >= 0.102.0
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    processors:
      batch:
        timeout: 5s
    exporters:
      otlp/tempo:
        endpoint: ${TEMPO_ENDPOINT}
        tls: { insecure: false }
      prometheus:
        endpoint: 0.0.0.0:8889
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]
    
  3. Point the backend at the Collector via environment, injected from a Secret/ConfigMap rather than hardcoded.

    # backend-deployment.yaml (excerpt)
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: "http://otel-collector.observability.svc:4317"
      - name: OTEL_SERVICE_NAME
        value: "backstage-backend"
    
  4. Build a catalog latency dashboard. Backstage emits catalog_stitching_duration_seconds and catalog_processing_duration_seconds. Chart the p95 of the processing duration by entity kind to surface slow processors.

    # Catalog processing p95 latency by entity kind
    histogram_quantile(
      0.95,
      sum(rate(catalog_processing_duration_seconds_bucket[5m])) by (le, kind)
    )
    
  5. Alert to PagerDuty and Slack through Alertmanager. Page on sustained ingestion latency; notify Slack on elevated HTTP error rates. Align the on-call routing with your Team Permission Models so the right group owns each alert.

    # alertmanager.yaml — requires Alertmanager >= 0.27.0
    route:
      receiver: slack-platform
      routes:
        - matchers: [ 'severity="critical"' ]
          receiver: pagerduty-platform
    receivers:
      - name: pagerduty-platform
        pagerduty_configs:
          - routing_key: ${PAGERDUTY_ROUTING_KEY}
      - name: slack-platform
        slack_configs:
          - api_url: ${SLACK_WEBHOOK_URL}
            channel: '#platform-alerts'
    
    # prometheus-rules.yaml — page when catalog p95 exceeds 5s for 10m
    groups:
      - name: backstage
        rules:
          - alert: CatalogIngestionSlow
            expr: histogram_quantile(0.95, sum(rate(catalog_processing_duration_seconds_bucket[5m])) by (le)) > 5
            for: 10m
            labels: { severity: critical }
            annotations:
              summary: "Backstage catalog p95 ingestion latency above 5s"
    

Validation

# 1. Confirm the backend exports spans (look for the OTLP exporter on boot)
kubectl logs deploy/backstage-backend -n backstage | grep -i "otel\|tracer"
# Expected: SDK start log, no exporter connection errors

# 2. Confirm the Collector is receiving and exporting
kubectl port-forward svc/otel-collector 8889:8889 -n observability &
curl -s localhost:8889/metrics | grep otelcol_receiver_accepted_spans
# Expected: a non-zero counter that increases under traffic

# 3. Confirm the catalog metric is scrapable
curl -s "http://prometheus:9090/api/v1/query?query=catalog_processing_duration_seconds_count" \
  | grep -o '"status":"success"'
# Expected: "status":"success"

# 4. Confirm an alert can fire end-to-end
amtool alert query alertname=CatalogIngestionSlow --alertmanager.url=http://alertmanager:9093
# Expected: the alert listed when latency breaches threshold

Healthy output: the backend logs the SDK start with no exporter errors, the Collector’s accepted-span counter rises under load, and the catalog metric query returns success.

Edge Cases & Troubleshooting

Symptom Root Cause Resolution
No spans reach the Collector instrumentation.ts not imported first Make import './instrumentation' the first line of index.ts
Metrics present, traces missing Trace exporter endpoint wrong or TLS mismatch Verify OTEL_EXPORTER_OTLP_ENDPOINT and Collector OTLP gRPC port 4317
Catalog metrics absent Prometheus not scraping the Collector’s :8889 Add a scrape job targeting the Collector’s Prometheus exporter
Alerts fire but no page Alertmanager route not matching severity label Confirm the rule sets severity: critical and the route matcher matches
Excessive trace volume / cost Auto-instrumentation tracing every request Add tail-based sampling in the Collector to keep error and slow traces

Frequently Asked Questions

Does OpenTelemetry replace Backstage’s built-in Prometheus metrics?

No — it complements them. Backstage already exposes catalog and HTTP metrics in Prometheus format. OpenTelemetry adds distributed traces and a unified OTLP export path, so you can correlate a slow catalog metric with the exact span that caused it.

Where should I run the Collector — sidecar or central?

For a single-cluster Backstage, a central Collector Deployment behind a service is simplest and cheapest. Move to a per-node DaemonSet only when trace volume or network locality demands it; the backend configuration is identical either way.