Published on

No Observability Strategy — Flying Blind in Production

Authors

Introduction

Observability is the ability to understand what your system is doing from its outputs — without needing to SSH in and read raw logs. The three pillars are metrics (what happened), logs (why it happened), and traces (where it happened across services). A system with no observability strategy is a system you can't operate confidently.

The Three Pillars

Metrics:    "Error rate spiked to 15% at 14:32"
What is happening (aggregated, time-series)
Tools: Prometheus, DataDog, CloudWatch

Logs:       "TypeError: Cannot read property 'id' of undefined
             at OrderService.processOrder (orders.service.ts:47)"
Why it happened (detailed, per-event)
Tools: Elasticsearch, Loki, CloudWatch Logs

Traces:     "Request took 3.2s: user-service 50ms, DB 2.8s, cache miss"
Where it happened across services (distributed context)
Tools: Jaeger, Zipkin, DataDog APM, OpenTelemetry

Fix 1: Structured Logging from Day One

// ❌ Unstructured logs — hard to query, hard to correlate
console.log('Processing order ' + orderId + ' for user ' + userId)
console.error('Error: ' + err.message)

// ✅ Structured logs — machine-parseable, correlatable
import pino from 'pino'

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: { service: 'order-service', version: process.env.APP_VERSION },
})

// Every log has consistent structure
logger.info({ orderId, userId, total }, 'Processing order')
logger.error({ orderId, error: err.message, stack: err.stack }, 'Order processing failed')

// Output:
// {"level":"info","time":"2026-03-15T14:32:00Z","service":"order-service",
//  "orderId":"ord_123","userId":"u456","total":1999,"msg":"Processing order"}

Fix 2: Request Tracing Across Services

import { trace, context, propagation } from '@opentelemetry/api'
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'

// Initialize OpenTelemetry once at startup
const provider = new NodeTracerProvider()
provider.register()

const tracer = trace.getTracer('order-service')

// Instrument your handlers
app.post('/orders', async (req, res) => {
  const span = tracer.startSpan('create-order')

  try {
    await context.with(trace.setSpan(context.active(), span), async () => {
      const order = await ordersService.createOrder(req.body)
      span.setAttributes({ orderId: order.id, total: order.total })
      res.json(order)
    })
  } catch (err) {
    span.recordException(err as Error)
    span.setStatus({ code: SpanStatusCode.ERROR })
    throw err
  } finally {
    span.end()
  }
})

// When calling downstream services, propagate trace context
async function callUserService(userId: string) {
  const headers: Record<string, string> = {}
  propagation.inject(context.active(), headers)
  // headers now includes traceparent: "00-{traceId}-{spanId}-01"
  return fetch(`${USER_SERVICE_URL}/users/${userId}`, { headers })
}

Fix 3: The Four Golden Signals

// Monitor these four metrics for every service:
// 1. Latency — how long requests take
// 2. Traffic — how many requests per second
// 3. Errors — error rate
// 4. Saturation — how "full" the service is (CPU, memory, queue depth)

import { Counter, Histogram } from 'prom-client'

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2, 5],
})

const httpErrors = new Counter({
  name: 'http_errors_total',
  help: 'Total HTTP errors',
  labelNames: ['method', 'route', 'status'],
})

// Express middleware
app.use((req, res, next) => {
  const end = httpDuration.startTimer()
  res.on('finish', () => {
    const labels = { method: req.method, route: req.route?.path ?? req.path, status: res.statusCode }
    end(labels)
    if (res.statusCode >= 400) httpErrors.inc(labels)
  })
  next()
})

// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

Fix 4: Alerting on SLOs, Not Just Thresholds

# Prometheus alerting rules
groups:
  - name: slo-alerts
    rules:
      # Alert if error rate > 1% over 5 minutes
      - alert: HighErrorRate
        expr: |
          rate(http_errors_total[5m]) / rate(http_request_duration_seconds_count[5m]) > 0.01
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.route }}"

      # Alert if p99 latency > 2s
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning

Observability Checklist

  • ✅ All logs are structured JSON with service name and correlation IDs
  • ✅ Distributed tracing is in place with W3C traceparent headers
  • ✅ Four golden signals (latency, traffic, errors, saturation) are monitored
  • ✅ Dashboards exist for every service before it goes to production
  • ✅ Alerts fire on SLO violations, not just raw thresholds
  • ✅ Any engineer can diagnose a production issue without SSH access
  • ✅ Trace IDs are included in error responses to users (for support correlation)

Conclusion

Observability isn't a feature you add after launch — it's the foundation that makes production operations possible. Structured logging, distributed tracing, and the four golden signals should be in place before your first production deploy. The rule is simple: if you can't answer "what is this service doing right now, and why is it slow?" from dashboards and traces alone, you're flying blind. Add OpenTelemetry instrumentation, ship logs to a queryable backend, and set up Prometheus metrics for every service. The investment pays back in the first incident.