Published on

The Grafana LGTM Stack — Logs, Metrics, Traces, and Profiles in One Platform

Authors

Introduction

Observability means understanding system behavior through logs, metrics, traces, and profiles. Historically, these signals lived in separate silos: Prometheus for metrics, ELK for logs, Jaeger for traces, and custom profiling tools. The Grafana LGTM stack (Loki, Grafana, Tempo, Mimir—and Prometheus for metrics) converges these signals into a single platform. With correlation between logs and traces, dashboards that pull from all sources, and unified alerting, incident resolution accelerates dramatically. This post covers Prometheus metrics with recording rules, Loki log aggregation, Tempo distributed tracing, building dashboards as code, and cost optimization.

Prometheus Metrics and Recording Rules

Prometheus scrapes metrics from instrumented applications. Recording rules pre-compute expensive queries and reduce query load.

Prometheus configuration (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: us-east-1
    environment: prod

rule_files:
- 'recording_rules.yml'
- 'alerting_rules.yml'

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: "true"
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)

- job_name: 'prometheus'
  static_configs:
  - targets:
    - localhost:9090

Recording rules (recording_rules.yml):

groups:
- name: api_server
  interval: 30s
  rules:
  - record: api:requests:rate1m
    expr: rate(api_requests_total[1m])

  - record: api:requests:rate5m
    expr: rate(api_requests_total[5m])

  - record: api:latency:p95
    expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m]))

  - record: api:latency:p99
    expr: histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[5m]))

  - record: api:error_rate:rate1m
    expr: rate(api_requests_total{status=~"5.."}[1m]) / rate(api_requests_total[1m])

- name: database
  interval: 30s
  rules:
  - record: db:connections:active
    expr: pg_stat_activity_count{state="active"}

  - record: db:connections:idle
    expr: pg_stat_activity_count{state="idle"}

  - record: db:replication:lag_bytes
    expr: pg_wal_lsn_lag_bytes

  - record: db:cache_hit_ratio
    expr: sum(rate(pg_heap_blks_hit[5m])) / sum(rate(pg_heap_blks_hit[5m] + pg_heap_blks_read[5m]))

Alerting rules (alerting_rules.yml):

groups:
- name: api_alerts
  rules:
  - alert: HighErrorRate
    expr: api:error_rate:rate1m > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate {{ $value | humanizePercentage }} exceeds 5% threshold"

  - alert: HighLatency
    expr: api:latency:p99 > 1.0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency on {{ $labels.instance }}"
      description: "P99 latency {{ $value }}s exceeds 1s threshold"

  - alert: DatabaseReplicationLag
    expr: db:replication:lag_bytes > 1073741824  # 1GB
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Database replication lag exceeds 1GB"

Loki Log Aggregation with LogQL

Loki is a log aggregation system designed for Grafana. Unlike Elasticsearch, Loki indexes labels (not log content), making it dramatically cheaper at scale.

Promtail configuration (promtail-config.yaml):

clients:
- url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: kubernetes
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - production
      - staging
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: pod
  - source_labels: [__meta_kubernetes_namespace_name]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_label_app]
    action: replace
    target_label: app
  - source_labels: [__meta_kubernetes_pod_container_name]
    action: replace
    target_label: container

- job_name: syslog
  syslog:
    listen_address: 0.0.0.0:514
    labels:
      job: syslog
  relabel_configs:
  - source_labels: [__syslog_message_hostname]
    target_label: hostname

LogQL queries:

# Count logs per app
count_over_time({app="api"}[5m])

# Filter errors
{app="api"} |= "error" | json | status >= 500

# Parse JSON logs and extract fields
{app="api"} | json level="level", status="status" | status >= 500

# Calculate error rate
sum(rate({app="api"} |= "error" | json status="status" | status >= 500 [5m]))
sum(rate({app="api"} | json status="status" [5m]))

# Show logs with specific pattern
{app="worker"} |= "timeout" | regex "request_id=(?P<request_id>[\\w-]+)" | request_id="abc123"

Tempo Distributed Tracing

Tempo captures end-to-end request flows. With Tempo in the LGTM stack, you can jump from a metric alert to logs to the trace of that request.

Tempo configuration (tempo.yaml):

server:
  http_listen_port: 3200
  grpc_listen_port: 4317

distributor:
  rate_limit_enabled: true
  rate_limit: 100000
  rate_limit_bytes: 100000000

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  max_block_duration: 5m

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks
    blocklist_poll: 5m

Instrumentation (OpenTelemetry in TypeScript):

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "api-server",
    [SemanticResourceAttributes.SERVICE_VERSION]: "1.2.3",
    environment: "production",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "http://tempo:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
console.log("OpenTelemetry SDK started");

process.on("SIGTERM", () => {
  sdk.shutdown()
    .then(() => console.log("Tracing terminated"))
    .catch((err) => console.log("Error terminating tracing", err))
    .finally(() => process.exit(0));
});

Grafana Dashboards as Code (JSON)

Store dashboards in Git. Use Grafonnet (Jsonnet DSL) or raw JSON.

Dashboard JSON:

{
  "dashboard": {
    "title": "API Server",
    "tags": ["api", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(api_requests_total[5m])",
            "legendFormat": "{{ handler }} {{ method }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(api_requests_total{status=~\"5..\"}[5m]) / rate(api_requests_total[5m])",
            "legendFormat": "{{ handler }}"
          }
        ],
        "type": "graph",
        "alert": {
          "name": "HighErrorRate",
          "conditions": [
            {
              "operator": { "type": "gt" },
              "query": { "params": ["0.05"] },
              "reducer": { "params": [] },
              "type": "query"
            }
          ]
        }
      }
    ]
  }
}

Grafonnet (Jsonnet) example:

local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local graph = grafana.graphPanel;
local prometheus = grafana.prometheus;

dashboard.new(
  title='API Server Dashboard',
  tags=['api', 'production'],
  timezone='browser',
  panels=[
    row.new(title='Request Metrics'),
    graph.new(
      title='Request Rate',
      targets=[prometheus.target('rate(api_requests_total[5m])')],
      gridPos={h: 8, w: 12, x: 0, y: 0},
    ),
    graph.new(
      title='Error Rate',
      targets=[
        prometheus.target(
          'rate(api_requests_total{status=~"5.."}[5m]) / rate(api_requests_total[5m])',
          legendFormat='{{ handler }}'
        )
      ],
      gridPos={h: 8, w: 12, x: 12, y: 0},
    ),
  ]
)

Alert Routing with Grafana OnCall

Route alerts based on rules, escalate to on-call responders, and notify via Slack, PagerDuty, email.

OnCall integration:

  1. Create escalation policy
  2. Create on-call schedule
  3. Route alerts to on-call
  4. Notify via Slack/webhook

Example alert routing:

groups:
- name: critical_alerts
  rules:
  - alert: DatabaseDown
    expr: up{job="postgres"} == 0
    for: 2m
    labels:
      severity: critical
      team: database
    annotations:
      summary: "PostgreSQL is down"
      description: "PostgreSQL on {{ $labels.instance }} is unreachable"

Configure notification channel in Grafana to send to OnCall. OnCall escalates based on schedule and acks.

Correlation Between Logs, Metrics, and Traces

Grafana unified querying correlates signals. Jump from metrics to logs to traces within Grafana.

Example workflow:

  1. Dashboard shows error rate spike
  2. Click on spike → view logs for that time range
  3. See error messages in logs
  4. Click on a log entry → view the trace for that request
  5. Trace shows bottleneck: slow database query
  6. Jump to database metrics: high query latency

Trace linking in dashboard:

{
  "fieldConfig": {
    "defaults": {
      "links": [
        {
          "targetBlank": true,
          "title": "View Trace",
          "url": "d/trace-detail?var-trace_id=${__data.fields.trace_id}"
        }
      ]
    }
  }
}

Cost Optimization

Observability at scale is expensive. Optimize carefully.

Metrics retention:

  • High resolution (15s scrape interval): 30 days
  • Lower resolution (1m): 1 year
  • Aggregates (hourly): 5 years
# Prometheus storage config
storage:
  tsdb:
    retention:
      size: "100GB"  # Keep last 100GB

Log sampling:

  • ERROR logs: 100% (keep all)
  • WARN logs: 100%
  • INFO logs: 10% (sample 90%)
  • DEBUG logs: 1% (sample 99%)
# Loki sampling rules
sampling:
  level: error
    percentage: 100
  level: warn
    percentage: 100
  level: info
    percentage: 10
  level: debug
    percentage: 1

Trace sampling:

  • Errors: 100%
  • Long requests (>1s): 50%
  • Normal requests: 5%
const sampler = new TraceIdRatioBasedSampler(0.05);

Cardinality control: Avoid high-cardinality labels. Bad:

api_requests_total{user_id="123456"}

Good:

api_requests_total{service="api", handler="orders"}

Checklist

  • Prometheus configured with appropriate scrape intervals per job
  • Recording rules pre-compute expensive queries (p95, p99, error rates)
  • Alerting rules define critical thresholds with runbook URLs
  • Loki configured with appropriate label selectors (not log content)
  • Promtail scrapes all relevant log sources and adds labels
  • Tempo ingests traces from all services via OpenTelemetry
  • Dashboards stored as JSON in Git; updated via CI/CD
  • Grafana alerts routed through OnCall with escalation policies
  • Links between metrics, logs, and traces configured
  • Retention policies set per signal type (30d metrics, 7d logs, 3d traces)
  • Sampling rules optimize cost without losing critical signals
  • Cardinality monitoring enabled; high-cardinality labels blocked
  • Runbooks linked from all production alerts

Conclusion

The Grafana LGTM stack converges observability signals, enabling faster incident resolution. Prometheus with recording rules pre-computes complex aggregations. Loki provides affordable log aggregation. Tempo captures distributed traces. Grafana correlates these signals and powers alerting. Store dashboards in Git, automate their deployment, and leverage unified querying to jump from metrics to logs to traces. With proper retention policies, sampling strategies, and cardinality controls, you can observe production systems at massive scale without cost spiraling out of control.