Setting Up OpenTelemetry Observability for Backstage
A production Backstage instance is a multi-plugin backend whose slowest path — usually catalog ingestion or permission evaluation — is invisible without distributed tracing. This how-to wires OpenTelemetry into the Backstage backend, ships traces and metrics through an OTel Collector, builds catalog-latency dashboards, and routes alerts to PagerDuty and Slack. It extends the operational guidance in the Backstage Architecture Deep Dive within the broader Developer Portal Architecture & Frameworks strategy.
Prerequisites
- Backstage on
@backstage/backend-defaults0.5.0+ (Node.js 20+). - An OpenTelemetry Collector 0.102.0+ reachable from the backend pods.
- A metrics backend (Prometheus 2.52+) and a tracing backend (Tempo or Jaeger 1.57+).
- Alertmanager 0.27+ with PagerDuty and Slack receivers configured.
- Cluster permissions to deploy the Collector as a sidecar or DaemonSet.
# Requires Node.js >= 20
yarn --cwd packages/backend add \
@opentelemetry/[email protected] \
@opentelemetry/[email protected] \
@opentelemetry/[email protected] \
@opentelemetry/[email protected]
Exact Configuration
-
Initialize the SDK before the backend boots. Backstage exposes Prometheus-style metrics already; OpenTelemetry adds traces and lets you export both over OTLP. Create
packages/backend/src/instrumentation.tsand import it first inindex.ts.// packages/backend/src/instrumentation.ts // Requires @opentelemetry/sdk-node >= 0.52.0 import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc'; import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { resourceFromAttributes } from '@opentelemetry/resources'; import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions'; const sdk = new NodeSDK({ resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: 'backstage-backend' }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }), metricReader: new PeriodicExportingMetricReader({ exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }), exportIntervalMillis: 15000, }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();// packages/backend/src/index.ts — must be the FIRST import import './instrumentation'; import { createBackend } from '@backstage/backend-defaults'; const backend = createBackend(); // ...register plugins backend.start();The auto-instrumentations cover Express, the PostgreSQL driver, and outbound HTTP — which is precisely the catalog-ingestion path you want visibility into.
-
Deploy the OTel Collector to receive OTLP, batch it, and fan out to Tempo and Prometheus.
# otel-collector-config.yaml — requires otelcol >= 0.102.0 receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 processors: batch: timeout: 5s exporters: otlp/tempo: endpoint: ${TEMPO_ENDPOINT} tls: { insecure: false } prometheus: endpoint: 0.0.0.0:8889 service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp/tempo] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] -
Point the backend at the Collector via environment, injected from a
Secret/ConfigMaprather than hardcoded.# backend-deployment.yaml (excerpt) env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://otel-collector.observability.svc:4317" - name: OTEL_SERVICE_NAME value: "backstage-backend" -
Build a catalog latency dashboard. Backstage emits
catalog_stitching_duration_secondsandcatalog_processing_duration_seconds. Chart the p95 of the processing duration by entity kind to surface slow processors.# Catalog processing p95 latency by entity kind histogram_quantile( 0.95, sum(rate(catalog_processing_duration_seconds_bucket[5m])) by (le, kind) ) -
Alert to PagerDuty and Slack through Alertmanager. Page on sustained ingestion latency; notify Slack on elevated HTTP error rates. Align the on-call routing with your Team Permission Models so the right group owns each alert.
# alertmanager.yaml — requires Alertmanager >= 0.27.0 route: receiver: slack-platform routes: - matchers: [ 'severity="critical"' ] receiver: pagerduty-platform receivers: - name: pagerduty-platform pagerduty_configs: - routing_key: ${PAGERDUTY_ROUTING_KEY} - name: slack-platform slack_configs: - api_url: ${SLACK_WEBHOOK_URL} channel: '#platform-alerts'# prometheus-rules.yaml — page when catalog p95 exceeds 5s for 10m groups: - name: backstage rules: - alert: CatalogIngestionSlow expr: histogram_quantile(0.95, sum(rate(catalog_processing_duration_seconds_bucket[5m])) by (le)) > 5 for: 10m labels: { severity: critical } annotations: summary: "Backstage catalog p95 ingestion latency above 5s"
Validation
# 1. Confirm the backend exports spans (look for the OTLP exporter on boot)
kubectl logs deploy/backstage-backend -n backstage | grep -i "otel\|tracer"
# Expected: SDK start log, no exporter connection errors
# 2. Confirm the Collector is receiving and exporting
kubectl port-forward svc/otel-collector 8889:8889 -n observability &
curl -s localhost:8889/metrics | grep otelcol_receiver_accepted_spans
# Expected: a non-zero counter that increases under traffic
# 3. Confirm the catalog metric is scrapable
curl -s "http://prometheus:9090/api/v1/query?query=catalog_processing_duration_seconds_count" \
| grep -o '"status":"success"'
# Expected: "status":"success"
# 4. Confirm an alert can fire end-to-end
amtool alert query alertname=CatalogIngestionSlow --alertmanager.url=http://alertmanager:9093
# Expected: the alert listed when latency breaches threshold
Healthy output: the backend logs the SDK start with no exporter errors, the Collector’s accepted-span counter rises under load, and the catalog metric query returns success.
Edge Cases & Troubleshooting
| Symptom | Root Cause | Resolution |
|---|---|---|
| No spans reach the Collector | instrumentation.ts not imported first |
Make import './instrumentation' the first line of index.ts |
| Metrics present, traces missing | Trace exporter endpoint wrong or TLS mismatch | Verify OTEL_EXPORTER_OTLP_ENDPOINT and Collector OTLP gRPC port 4317 |
| Catalog metrics absent | Prometheus not scraping the Collector’s :8889 |
Add a scrape job targeting the Collector’s Prometheus exporter |
| Alerts fire but no page | Alertmanager route not matching severity label |
Confirm the rule sets severity: critical and the route matcher matches |
| Excessive trace volume / cost | Auto-instrumentation tracing every request | Add tail-based sampling in the Collector to keep error and slow traces |
Frequently Asked Questions
Does OpenTelemetry replace Backstage’s built-in Prometheus metrics?
No — it complements them. Backstage already exposes catalog and HTTP metrics in Prometheus format. OpenTelemetry adds distributed traces and a unified OTLP export path, so you can correlate a slow catalog metric with the exact span that caused it.
Where should I run the Collector — sidecar or central?
For a single-cluster Backstage, a central Collector Deployment behind a service is simplest and cheapest. Move to a per-node DaemonSet only when trace volume or network locality demands it; the backend configuration is identical either way.