Observability Explained: Metrics, Logs, and Traces

Observability is the ability to understand what is happening inside a running system from the data it emits on the outside — its metrics, logs, and traces. When a request is slow, an error spikes, or a deploy quietly breaks a feature nobody watches, observability is what lets you ask a brand-new question of your production system and get an answer without shipping new code to find out. This guide explains the three signals that make a system observable, how observability differs from plain monitoring, the architecture that collects and stores the data, and how OpenTelemetry has become the vendor-neutral standard that ties it all together.

This is the pillar page for the Devgains observability cluster. It sits underneath the Kubernetes architecture guide — once you run more than one service, observability stops being optional — and pairs naturally with liveness and readiness probes, which are the simplest health signal a system emits.

Quick answer: what is observability?

Observability is a property of a system: how well you can infer its internal state from its external outputs. In practice it rests on three pillars — three kinds of telemetry that each answer a different question:

Metrics — cheap, aggregatable numbers over time. Is something wrong, and how much? (request rate, error rate, latency percentiles, CPU, memory).
Logs — timestamped records of discrete events. What exactly happened at this moment?
Traces — the end-to-end path of a single request across services. Where, in a chain of calls, did the time or the error go?

The one-line rule: metrics tell you that something is wrong, traces tell you where, and logs tell you why. You need all three because each is blind to what the others see clearly.

Observability vs monitoring: what's the difference?

The terms get used interchangeably, but the distinction is real and useful.

Monitoring answers known questions. You decide in advance what matters ("alert me when error rate > 1%"), build a dashboard, and watch it. Monitoring is about known unknowns.
Observability lets you answer unknown questions after the fact. When a failure you never predicted happens, rich telemetry lets you slice, filter, and correlate your way to the cause — no redeploy required. Observability is about unknown unknowns.

Monitoring is a subset of what a well-instrumented, observable system gives you. You still build dashboards and alerts; observability just means the underlying data is high-cardinality and correlated enough to answer questions you didn't think to ask.

Why observability matters

Modern systems are distributed, and distributed systems fail in ways a single stack trace can't explain. A user-facing request might touch a load balancer, three microservices, a cache, and a database before returning. When it's slow, the log line in service A says nothing about the 400ms that service C spent waiting on a saturated connection pool. Observability exists to make that invisible chain visible.

The payoff is concrete:

Faster incident resolution. Lower MTTR (mean time to resolution) when you can jump from "error rate up" to "this endpoint, this dependency, this line" in minutes.
Safer deploys. Watch the golden signals during a rolling update and roll back on real evidence, not a hunch.
Capacity and cost. Metrics over time tell you what to scale and what you're overpaying for.

Architecture: how observability data flows

A production observability pipeline has four stages, and every signal type flows through the same shape:

Instrumentation — your code (or an auto-instrumentation agent) emits telemetry. This is where OpenTelemetry SDKs live.
Collection — an agent or the OpenTelemetry Collector receives, batches, filters, and enriches the data, then fans it out to backends.
Storage — time-series databases for metrics (Prometheus, Mimir), log stores (Loki, Elasticsearch), and trace stores (Tempo, Jaeger).
Query and visualization — dashboards, alerts, and trace explorers (Grafana is the common front end).

The key architectural idea of the last few years: decouple instrumentation from backends. You instrument once with OpenTelemetry and can swap Prometheus for a hosted vendor, or Jaeger for Tempo, without touching application code. That is the whole point of the standard.

The three pillars compared

Signal	Answers	Cost	Cardinality	Typical tool
Metrics	Is it broken? How much?	Very low (aggregated)	Low–medium	Prometheus, Grafana
Logs	What happened here?	Medium–high (per event)	High	Loki, Elasticsearch
Traces	Where did the time/error go?	Medium (sampled)	Very high	Jaeger, Tempo
Profiles (emerging 4th)	Which code path burned CPU/RAM?	Medium	High	Pyroscope

Cardinality is the number of unique label combinations. It's why you put http.status_code on a metric but not user_id — high-cardinality labels explode metric storage, which is exactly what logs and traces are for.

Step-by-step: instrument a service with OpenTelemetry

Here's the minimal path to emitting real telemetry. First, add auto-instrumentation to a Node.js service — this captures HTTP, database, and framework spans without changing business logic:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

// tracing.ts — load this BEFORE your app code (node -r ./tracing.js app.js)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
 
const sdk = new NodeSDK({
  // Send spans to a local OpenTelemetry Collector over OTLP/HTTP.
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
 
sdk.start(); // Every incoming request now produces a trace automatically.

This alone gives you distributed traces: each request becomes a trace made of spans, and the trace context propagates across service boundaries via HTTP headers, so a call from service A to service B stitches into one timeline.

Next, run an OpenTelemetry Collector to receive that data and route it to your backends. The Collector config is three sections — receivers, processors, exporters — wired together in a pipeline:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:   # listens on :4318
      grpc:   # listens on :4317
 
processors:
  batch: {}                 # batch spans for efficient export
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80    # protect the collector from OOM under load
 
exporters:
  otlphttp/traces:
    endpoint: http://tempo:4318      # trace backend
  prometheus:
    endpoint: 0.0.0.0:8889           # scrape target for metrics
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/traces]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

The Collector is the seam that makes your telemetry portable: point the exporters somewhere else and the whole fleet reroutes without a single app redeploy.

Best practices

Instrument with OpenTelemetry, not a vendor SDK. Vendor-neutral instrumentation is the one decision that's expensive to reverse. Do it first.
Track the four golden signals from Google's SRE book — latency, traffic, errors, saturation — for every service. They catch most problems before users do.
Use percentiles, not averages. A p50 of 100ms hides a p99 of 4s. Averages lie; percentiles tell you about the users having a bad time.
Correlate the three pillars. Put a trace_id in your structured logs so a spiky metric links to a trace, and a trace links to the exact log lines. Correlation is what turns three data sources into one investigation.
Sample traces intelligently. Keep 100% of errors and slow requests (tail-based sampling), a small fraction of the rest. You don't need every healthy request stored forever.
Emit structured (JSON) logs. Grep-able free text doesn't scale; queryable fields do.

Common mistakes

High-cardinality metric labels. Putting user_id or request_id on a metric can create millions of time series and take down your metrics store. That data belongs in logs and traces.
Logging everything at INFO. Log volume becomes cost and noise. Log decisions and errors, not every loop iteration.
Dashboards nobody reads. A wall of 60 panels is not observability. Build a few golden-signal dashboards and alert on symptoms users feel, not on every internal metric.
Alerting on causes instead of symptoms. Page on "checkout error rate > 1%," not "CPU > 80%." High CPU is often fine; a broken checkout never is.
No trace context propagation. If services don't forward the trace headers, your "distributed" traces stop at the first hop and you've lost the one thing traces are for.

Key takeaways

Observability = inferring internal state from metrics, logs, and traces.
Metrics = is it broken; traces = where; logs = why.
Monitoring answers known questions; observability lets you answer new ones after the fact.
OpenTelemetry decouples instrumentation from backends — instrument once, swap vendors freely.
Correlate the pillars with a shared trace_id, and alert on user-facing symptoms.

FAQ

What is observability in simple terms? It's how well you can understand what a running system is doing from the outside, using the data it emits — so you can diagnose problems you didn't predict without adding new code.

What are the three pillars of observability? Metrics (aggregated numbers over time), logs (discrete event records), and traces (the path of a single request across services). Some teams add a fourth pillar, continuous profiling.

Is observability the same as monitoring? No. Monitoring watches for known, predefined problems. Observability lets you ask new questions of your telemetry after an unexpected failure. Monitoring is one thing you do with an observable system.

Why use OpenTelemetry? It's the vendor-neutral CNCF standard for generating and collecting telemetry. You instrument your code once and can send data to any backend — Prometheus, Jaeger, Grafana, or a hosted vendor — without rewriting instrumentation.

Do I need all three pillars? For anything distributed, yes. Metrics alone tell you something's wrong but not where; traces show where without the detail; logs give detail without the system-wide view. Together they close the loop.

Conclusion

Observability isn't a product you buy — it's a property you build into a system by emitting the right telemetry and correlating it. Start with the four golden signals as metrics, add distributed tracing with OpenTelemetry so you can see across service boundaries, and keep structured logs for the detail. Instrument once with the open standard, and you keep the freedom to change everything behind it. From here, the observability cluster goes deeper into each signal — Prometheus and metric design, the OpenTelemetry Collector in production, and building dashboards in Grafana. Browse the full observability category and the broader DevOps guides to continue.

References

OpenTelemetry Documentation — concepts, SDKs, and the Collector.
Prometheus Documentation — metrics model and querying.
Google SRE Book: Monitoring Distributed Systems — the four golden signals.
Grafana Documentation — visualization and correlation across signals.
CNCF OpenTelemetry Project — governance and standard status.