- Published on
No Observability Strategy — Flying Blind in Production
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Observability is the ability to understand what your system is doing from its outputs — without needing to SSH in and read raw logs. The three pillars are metrics (what happened), logs (why it happened), and traces (where it happened across services). A system with no observability strategy is a system you can't operate confidently.
- The Three Pillars
- Fix 1: Structured Logging from Day One
- Fix 2: Request Tracing Across Services
- Fix 3: The Four Golden Signals
- Fix 4: Alerting on SLOs, Not Just Thresholds
- Observability Checklist
- Conclusion
The Three Pillars
Metrics: "Error rate spiked to 15% at 14:32"
→ What is happening (aggregated, time-series)
→ Tools: Prometheus, DataDog, CloudWatch
Logs: "TypeError: Cannot read property 'id' of undefined
at OrderService.processOrder (orders.service.ts:47)"
→ Why it happened (detailed, per-event)
→ Tools: Elasticsearch, Loki, CloudWatch Logs
Traces: "Request took 3.2s: user-service 50ms, DB 2.8s, cache miss"
→ Where it happened across services (distributed context)
→ Tools: Jaeger, Zipkin, DataDog APM, OpenTelemetry
Fix 1: Structured Logging from Day One
// ❌ Unstructured logs — hard to query, hard to correlate
console.log('Processing order ' + orderId + ' for user ' + userId)
console.error('Error: ' + err.message)
// ✅ Structured logs — machine-parseable, correlatable
import pino from 'pino'
const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: { service: 'order-service', version: process.env.APP_VERSION },
})
// Every log has consistent structure
logger.info({ orderId, userId, total }, 'Processing order')
logger.error({ orderId, error: err.message, stack: err.stack }, 'Order processing failed')
// Output:
// {"level":"info","time":"2026-03-15T14:32:00Z","service":"order-service",
// "orderId":"ord_123","userId":"u456","total":1999,"msg":"Processing order"}
Fix 2: Request Tracing Across Services
import { trace, context, propagation } from '@opentelemetry/api'
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
// Initialize OpenTelemetry once at startup
const provider = new NodeTracerProvider()
provider.register()
const tracer = trace.getTracer('order-service')
// Instrument your handlers
app.post('/orders', async (req, res) => {
const span = tracer.startSpan('create-order')
try {
await context.with(trace.setSpan(context.active(), span), async () => {
const order = await ordersService.createOrder(req.body)
span.setAttributes({ orderId: order.id, total: order.total })
res.json(order)
})
} catch (err) {
span.recordException(err as Error)
span.setStatus({ code: SpanStatusCode.ERROR })
throw err
} finally {
span.end()
}
})
// When calling downstream services, propagate trace context
async function callUserService(userId: string) {
const headers: Record<string, string> = {}
propagation.inject(context.active(), headers)
// headers now includes traceparent: "00-{traceId}-{spanId}-01"
return fetch(`${USER_SERVICE_URL}/users/${userId}`, { headers })
}
Fix 3: The Four Golden Signals
// Monitor these four metrics for every service:
// 1. Latency — how long requests take
// 2. Traffic — how many requests per second
// 3. Errors — error rate
// 4. Saturation — how "full" the service is (CPU, memory, queue depth)
import { Counter, Histogram } from 'prom-client'
const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2, 5],
})
const httpErrors = new Counter({
name: 'http_errors_total',
help: 'Total HTTP errors',
labelNames: ['method', 'route', 'status'],
})
// Express middleware
app.use((req, res, next) => {
const end = httpDuration.startTimer()
res.on('finish', () => {
const labels = { method: req.method, route: req.route?.path ?? req.path, status: res.statusCode }
end(labels)
if (res.statusCode >= 400) httpErrors.inc(labels)
})
next()
})
// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})
Fix 4: Alerting on SLOs, Not Just Thresholds
# Prometheus alerting rules
groups:
- name: slo-alerts
rules:
# Alert if error rate > 1% over 5 minutes
- alert: HighErrorRate
expr: |
rate(http_errors_total[5m]) / rate(http_request_duration_seconds_count[5m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.route }}"
# Alert if p99 latency > 2s
- alert: HighLatency
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
Observability Checklist
- ✅ All logs are structured JSON with service name and correlation IDs
- ✅ Distributed tracing is in place with W3C traceparent headers
- ✅ Four golden signals (latency, traffic, errors, saturation) are monitored
- ✅ Dashboards exist for every service before it goes to production
- ✅ Alerts fire on SLO violations, not just raw thresholds
- ✅ Any engineer can diagnose a production issue without SSH access
- ✅ Trace IDs are included in error responses to users (for support correlation)
Conclusion
Observability isn't a feature you add after launch — it's the foundation that makes production operations possible. Structured logging, distributed tracing, and the four golden signals should be in place before your first production deploy. The rule is simple: if you can't answer "what is this service doing right now, and why is it slow?" from dashboards and traces alone, you're flying blind. Add OpenTelemetry instrumentation, ship logs to a queryable backend, and set up Prometheus metrics for every service. The investment pays back in the first incident.