Setting Up OpenTelemetry Observability for Backstage

A production Backstage instance is a multi-plugin backend whose slowest path — usually catalog ingestion or permission evaluation — is invisible without distributed tracing. This how-to wires OpenTelemetry into the Backstage backend, ships traces and metrics through an OTel Collector, builds catalog-latency dashboards, and routes alerts to PagerDuty and Slack. It extends the operational guidance in the Backstage Architecture Deep Dive within the broader Developer Portal Architecture & Frameworks strategy.

Prerequisites

The pipeline you are about to build has a fixed shape: the backend emits OTLP to a Collector, which fans traces out to a tracing backend and metrics to Prometheus, with Alertmanager routing to PagerDuty and Slack. Everything below is a prerequisite for one hop in that chain.

Traces and metrics diverge at the Collector; alerting reads the metrics side, so both backends must be reachable.

Backstage on @backstage/backend-defaults 0.5.0+ (Node.js 20+).
An OpenTelemetry Collector 0.102.0+ reachable from the backend pods.
A metrics backend (Prometheus 2.52+) and a tracing backend (Tempo or Jaeger 1.57+).
Alertmanager 0.27+ with PagerDuty and Slack receivers configured.
Cluster permissions to deploy the Collector as a sidecar or DaemonSet.

# Requires Node.js >= 20
yarn --cwd packages/backend add \
  @opentelemetry/[email protected] \
  @opentelemetry/[email protected] \
  @opentelemetry/[email protected] \
  @opentelemetry/[email protected]

Exact Configuration

The configuration proceeds in five moves, each enabling the next: instrument the backend, stand up the Collector, point the backend at it, chart the catalog metric, and finally wire the alert. The diagram sequences them so you can see where each config file fits.

The SDK import must come first; every later step depends on the backend already emitting telemetry.

Initialize the SDK before the backend boots. Backstage exposes Prometheus-style metrics already; OpenTelemetry adds traces and lets you export both over OTLP. Create packages/backend/src/instrumentation.ts and import it first in index.ts.

// packages/backend/src/instrumentation.ts
// Requires @opentelemetry/sdk-node >= 0.52.0
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { resourceFromAttributes } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: 'backstage-backend' }),
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

// packages/backend/src/index.ts — must be the FIRST import
import './instrumentation';
import { createBackend } from '@backstage/backend-defaults';
const backend = createBackend();
// ...register plugins
backend.start();

The auto-instrumentations cover Express, the PostgreSQL driver, and outbound HTTP — which is precisely the catalog-ingestion path you want visibility into.

Deploy the OTel Collector to receive OTLP, batch it, and fan out to Tempo and Prometheus.

# otel-collector-config.yaml — requires otelcol >= 0.102.0
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  batch:
    timeout: 5s
exporters:
  otlp/tempo:
    endpoint: ${TEMPO_ENDPOINT}
    tls: { insecure: false }
  prometheus:
    endpoint: 0.0.0.0:8889
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Point the backend at the Collector via environment, injected from a Secret/ConfigMap rather than hardcoded.

# backend-deployment.yaml (excerpt)
env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.observability.svc:4317"
  - name: OTEL_SERVICE_NAME
    value: "backstage-backend"

Build a catalog latency dashboard. Backstage emits catalog_stitching_duration_seconds and catalog_processing_duration_seconds. Chart the p95 of the processing duration by entity kind to surface slow processors.
```
# Catalog processing p95 latency by entity kind
histogram_quantile(
  0.95,
  sum(rate(catalog_processing_duration_seconds_bucket[5m])) by (le, kind)
)
```

Alert to PagerDuty and Slack through Alertmanager. Page on sustained ingestion latency; notify Slack on elevated HTTP error rates. Align the on-call routing with your Team Permission Models so the right group owns each alert.

# alertmanager.yaml — requires Alertmanager >= 0.27.0
route:
  receiver: slack-platform
  routes:
    - matchers: [ 'severity="critical"' ]
      receiver: pagerduty-platform
receivers:
  - name: pagerduty-platform
    pagerduty_configs:
      - routing_key: ${PAGERDUTY_ROUTING_KEY}
  - name: slack-platform
    slack_configs:
      - api_url: ${SLACK_WEBHOOK_URL}
        channel: '#platform-alerts'

# prometheus-rules.yaml — page when catalog p95 exceeds 5s for 10m
groups:
  - name: backstage
    rules:
      - alert: CatalogIngestionSlow
        expr: histogram_quantile(0.95, sum(rate(catalog_processing_duration_seconds_bucket[5m])) by (le)) > 5
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Backstage catalog p95 ingestion latency above 5s"

Validation

Validate the pipeline hop by hop: the backend exports, the Collector receives, Prometheus scrapes, and an alert can fire. Checking them in order localizes any break to a single link in the chain.

A broken hop shows exactly where telemetry stops flowing, so you never debug the whole pipeline at once.

# 1. Confirm the backend exports spans (look for the OTLP exporter on boot)
kubectl logs deploy/backstage-backend -n backstage | grep -i "otel\|tracer"
# Expected: SDK start log, no exporter connection errors

# 2. Confirm the Collector is receiving and exporting
kubectl port-forward svc/otel-collector 8889:8889 -n observability &
curl -s localhost:8889/metrics | grep otelcol_receiver_accepted_spans
# Expected: a non-zero counter that increases under traffic

# 3. Confirm the catalog metric is scrapable
curl -s "http://prometheus:9090/api/v1/query?query=catalog_processing_duration_seconds_count" \
  | grep -o '"status":"success"'
# Expected: "status":"success"

# 4. Confirm an alert can fire end-to-end
amtool alert query alertname=CatalogIngestionSlow --alertmanager.url=http://alertmanager:9093
# Expected: the alert listed when latency breaches threshold

Healthy output: the backend logs the SDK start with no exporter errors, the Collector’s accepted-span counter rises under load, and the catalog metric query returns success.

Edge Cases & Troubleshooting

Observability breakages fall into two buckets: telemetry that never leaves the backend, and telemetry that arrives but is not scraped, routed, or sampled correctly. The diagram splits the common symptoms across those two buckets so you know whether to look at the SDK or the Collector.

Deciding emit-side versus collect-side first cuts the search space in half before you read a single log line.

Symptom	Root Cause	Resolution
No spans reach the Collector	`instrumentation.ts` not imported first	Make `import './instrumentation'` the first line of `index.ts`
Metrics present, traces missing	Trace exporter endpoint wrong or TLS mismatch	Verify `OTEL_EXPORTER_OTLP_ENDPOINT` and Collector OTLP gRPC port 4317
Catalog metrics absent	Prometheus not scraping the Collector’s `:8889`	Add a scrape job targeting the Collector’s Prometheus exporter
Alerts fire but no page	Alertmanager route not matching `severity` label	Confirm the rule sets `severity: critical` and the route matcher matches
Excessive trace volume / cost	Auto-instrumentation tracing every request	Add tail-based sampling in the Collector to keep error and slow traces

Frequently Asked Questions

Does OpenTelemetry replace Backstage’s built-in Prometheus metrics?

No — it complements them. Backstage already exposes catalog and HTTP metrics in Prometheus format. OpenTelemetry adds distributed traces and a unified OTLP export path, so you can correlate a slow catalog metric with the exact span that caused it.

Where should I run the Collector — sidecar or central?

For a single-cluster Backstage, a central Collector Deployment behind a service is simplest and cheapest. Move to a per-node DaemonSet only when trace volume or network locality demands it; the backend configuration is identical either way.