Syncing GitLab Repositories into the Backstage Catalog

Keeping the software catalog in sync with reality means the catalog should discover repositories automatically rather than depending on engineers to register each catalog-info.yaml by hand. This how-to configures the GitLab discovery entity provider so Backstage scans your GitLab groups, ingests catalog files, and removes entities when repos disappear. It applies the broader Catalog Integration Patterns to a GitLab source.

Automatic discovery matters because a stale catalog erodes trust: if half the services aren’t listed, engineers stop using the portal as a source of truth.

Prerequisites

  • Backstage 1.20+ with the new backend system and @backstage/plugin-catalog-backend-module-gitlab 0.3+.
  • A GitLab instance (SaaS gitlab.com or self-managed) reachable from the backend.
  • A GitLab personal or group access token with read_api and read_repository scopes, exported as ${GITLAB_TOKEN}.
  • Network egress from the backend to the GitLab API host.
  • catalog-info.yaml files present at the default branch root of the repositories you want discovered.

Exact Configuration

1. Add the GitLab integration credentials

The integrations.gitlab block authenticates API calls. Use the env placeholder rather than a literal token.

# app-config.yaml
# Requires @backstage/integration >= 1.12.0
integrations:
  gitlab:
    - host: gitlab.com
      token: ${GITLAB_TOKEN}
    # For self-managed:
    # - host: gitlab.internal.corp
    #   apiBaseUrl: https://gitlab.internal.corp/api/v4
    #   baseUrl: https://gitlab.internal.corp
    #   token: ${GITLAB_SELF_MANAGED_TOKEN}

2. Configure the discovery provider

The catalog.providers.gitlab block tells the provider which group to scan, which branch and file to read, and how often to refresh. The group narrows the scan; entityFilename is the catalog file to look for in each repo.

# app-config.yaml
# Requires @backstage/plugin-catalog-backend-module-gitlab >= 0.3.0
catalog:
  providers:
    gitlab:
      yourOrg:
        host: gitlab.com
        group: platform-engineering          # scans this group and subgroups
        branch: main                          # branch to read catalog files from
        entityFilename: catalog-info.yaml
        projectPattern: '[\s\S]*'             # regex of projects to include
        excludeRepos: ['archived-service']    # repos to skip
        schedule:
          frequency: { minutes: 30 }
          timeout: { minutes: 3 }

3. Register the provider module in the backend

The module is a backend feature; add it alongside the catalog backend.

// packages/backend/src/index.ts
// Requires @backstage/plugin-catalog-backend-module-gitlab >= 0.3.0
import { createBackend } from '@backstage/backend-defaults';

const backend = createBackend();
backend.add(import('@backstage/plugin-catalog-backend'));
backend.add(import('@backstage/plugin-catalog-backend-module-gitlab/alpha'));
backend.start();

The schedule block in config drives refresh automatically — no extra scheduler wiring is needed in the new backend system. If you instead need event-driven sync (push hooks rather than polling), add the GitLab events module and a webhook, but scheduled discovery is the reliable baseline.

4. Ensure repositories expose a catalog file

Each discoverable repo needs a catalog-info.yaml at the configured branch root:

# catalog-info.yaml (in each GitLab repo)
# Compatible with Backstage 1.20+
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  annotations:
    gitlab.com/project-slug: platform-engineering/payments-api
spec:
  type: service
  lifecycle: production
  owner: team-payments

The gitlab.com/project-slug annotation links the entity back to its source project for source-location features. Keep owner aligned with the groups your access model uses so discovery feeds correct ownership downstream.

Validation

# Requires the backend running via `yarn start-backend`

# 1. Confirm the provider registered and scheduled
grep -i "gitlab" <(yarn start-backend 2>&1) | grep -i "provider"
# expected: a log line registering GitlabDiscoveryEntityProvider for 'yourOrg'

# 2. Query the catalog for discovered entities
curl -s "http://localhost:7007/api/catalog/entities?filter=metadata.annotations.gitlab.com/project-slug" \
  -H "Authorization: Bearer ${BACKSTAGE_TOKEN}" | jq '.[].metadata.name'
# expected: names of repos that had a catalog-info.yaml

# 3. Inspect the location entity created by the provider
curl -s "http://localhost:7007/api/catalog/entities?filter=kind=location,spec.type=url" \
  -H "Authorization: Bearer ${BACKSTAGE_TOKEN}" | jq '.[].spec.target' | grep gitlab
# expected: gitlab raw URLs to each catalog-info.yaml

Expected backend log on a successful scan:

catalog info Discovered 14 GitLab projects for group 'platform-engineering' provider=GitlabDiscoveryEntityProvider:yourOrg

Edge Cases & Troubleshooting

Symptom Root Cause Resolution
No entities discovered, no errors group mismatch or token lacks read_api Verify the group path and that ${GITLAB_TOKEN} has read_api + read_repository
401/403 in backend logs Invalid or expired token Rotate the token and confirm it is injected as ${GITLAB_TOKEN}
Self-managed instance not scanned Missing apiBaseUrl/baseUrl Add both under integrations.gitlab for the self-managed host
Entities never removed after repo deletion Provider not running as the authoritative source The discovery provider is full-mutation; ensure no manual static location duplicates the same entities
Subgroup repos missing Provider not recursing Confirm the token can read subgroups; group includes subgroups by default unless projectPattern excludes them

Frequently Asked Questions

Should I use the discovery provider or static catalog.locations?

Use the discovery provider for any group with more than a handful of repos. Static locations require a manual edit per repo and drift quickly. The provider treats GitLab as the authoritative source and reconciles additions and deletions on its schedule.

How do I avoid hitting GitLab API rate limits during scans?

Widen the schedule.frequency (30 minutes is a sane default), scope group as narrowly as possible, and use a group access token rather than a personal one so the rate budget isn’t tied to an individual. For very large groups, run discovery on a longer interval and rely on push events for near-real-time updates.