Distributed Tracing in Node.js Microservices

Learn how to implement distributed tracing in Node.js microservices using OpenTelemetry. Discover traces, spans, context propagation, and production best practices.
While I was debugging a production issue the other day across five different Node.js microservices, I realized something frustrating: I had logs, I had metrics, but I had absolutely no idea which service was actually causing the cascade of failures. The logs showed errors in Service A, timeouts in Service B, and strange database queries in Service C. Were they related? Who knows!
This is where distributed tracing saved my sanity. I was once guilty of thinking "logs are enough" until I spent three hours following a request that bounced through seven services and died somewhere in the middle. Little did I know that distributed tracing would become one of my most valuable tools for understanding microservice behavior.
Why Distributed Tracing Is Critical for Node.js Microservices
When I finally decided to implement proper distributed tracing, the difference was night and day. Instead of piecing together logs from different services manually, I could see the entire journey of a request in one visualization.
Here's what distributed tracing gives you that logs alone cannot:
Complete request visibility. You see every service involved in handling a request, how long each took, and where failures occurred. When a user complains about slow checkout, you immediately know if it's the payment service, inventory service, or notification service causing the delay.
Performance bottlenecks become obvious. I once discovered that our "fast" cache lookup was taking 800ms because it was making seven database calls under the hood. The trace showed it clearly. Without tracing, I would have blamed the downstream services.
Dependency mapping happens automatically. Your traces reveal which services talk to which, what databases they hit, and what external APIs they call. This is wonderful for new team members trying to understand the system.
The ROI on learning distributed tracing is immediate. You'll debug production issues faster and catch performance problems before they become critical.
Understanding Traces, Spans, and Context Propagation
Before diving into code, let's clarify what these terms actually mean. I remember being confused by this vocabulary initially.
A trace represents one complete journey through your system. When a user clicks "Place Order," that generates one trace that follows the request through all your services.
A span represents one operation within that trace. Each service creates spans for the work it does. The order service creates a span for "process order," the payment service creates a span for "charge card," and so on. Spans can have child spans—the "charge card" span might have child spans for "validate card" and "send to payment gateway."
Context propagation is how services pass trace information to each other. When Service A calls Service B, it includes trace headers in the HTTP request. Service B reads these headers and creates its spans as children of Service A's span. This is what connects everything together.
In other words, without context propagation, you'd just have disconnected spans in different services. With it, you get a complete picture of the request flow.

Setting Up OpenTelemetry in a Node.js Service
OpenTelemetry has become the standard for distributed tracing, and luckily we can set it up in Node.js pretty easily. Here's how I set up a basic Express service with tracing:
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': {
enabled: false, // Disable noisy filesystem traces
},
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});The critical part here is calling this before you import your application code. I made the mistake once of initializing OpenTelemetry after Express was already loaded, and nothing got instrumented. Your main file should look like this:
// index.js
require('./tracing'); // MUST be first!
const express = require('express');
const app = express();
app.get('/orders/:id', async (req, res) => {
// Your route logic here
// Automatically traced by OpenTelemetry!
});
app.listen(3000);This setup automatically instruments HTTP requests, database calls (if you're using supported libraries), and other common operations. Wonderful!
Auto-Instrumentation vs Manual Instrumentation: When to Use Each
Auto-instrumentation handles the boring stuff—HTTP servers, database queries, Redis calls. It works great for infrastructure-level tracing. When I first started with distributed tracing, I was fascinated by how much visibility I got without writing any code.
But auto-instrumentation doesn't know about your business logic. It can't tell you that "calculateShippingCost" took 2 seconds or that "validateInventory" failed for the third time.
That's where manual instrumentation comes in. Here's how I add custom spans for business operations:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processOrder(orderId: string, userId: string) {
const span = tracer.startSpan('processOrder', {
attributes: {
'order.id': orderId,
'user.id': userId,
},
});
try {
// Validate inventory
const inventorySpan = tracer.startSpan('validateInventory', {
parent: span,
});
const items = await checkInventory(orderId);
inventorySpan.setAttribute('inventory.items.count', items.length);
inventorySpan.end();
// Calculate shipping
const shippingSpan = tracer.startSpan('calculateShipping', {
parent: span,
});
const shippingCost = await calculateShipping(items);
shippingSpan.setAttribute('shipping.cost', shippingCost);
shippingSpan.end();
// Process payment
const paymentSpan = tracer.startSpan('processPayment', {
parent: span,
});
await chargeCard(userId, shippingCost);
paymentSpan.end();
span.setStatus({ code: SpanStatusCode.OK });
return { success: true };
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
}I use auto-instrumentation for 80% of cases and manual instrumentation for critical business operations that need detailed tracking. The combination gives you both infrastructure visibility and business insight.
Tracing Across Service Boundaries: HTTP, Message Queues, and gRPC
Context propagation works automatically for HTTP when you use auto-instrumentation. But I came across situations where I needed to trace through message queues and gRPC calls.
For HTTP, OpenTelemetry automatically adds headers like traceparent and tracestate to outgoing requests. The receiving service reads these headers and continues the trace.
For message queues (RabbitMQ, Kafka, etc.), you need to manually inject context into message headers:
import { context, propagation } from '@opentelemetry/api';
import * as amqp from 'amqplib';
async function publishOrderEvent(order: Order) {
const span = tracer.startSpan('publishOrderEvent');
try {
const channel = await connection.createChannel();
const queue = 'order-events';
// Inject tracing context into message headers
const headers: Record<string, any> = {};
propagation.inject(context.active(), headers);
await channel.sendToQueue(
queue,
Buffer.from(JSON.stringify(order)),
{
headers: headers,
persistent: true,
}
);
span.end();
} catch (error) {
span.recordException(error);
span.end();
throw error;
}
}
async function consumeOrderEvent(msg: amqp.Message) {
// Extract tracing context from message headers
const parentContext = propagation.extract(
context.active(),
msg.properties.headers
);
const span = tracer.startSpan(
'consumeOrderEvent',
undefined,
parentContext
);
try {
const order = JSON.parse(msg.content.toString());
await processOrder(order);
span.end();
} catch (error) {
span.recordException(error);
span.end();
}
}I cannot stress this enough! Without manual propagation in message queues, your traces will be disconnected. You'll see the publisher's trace and the consumer's trace as separate, unrelated operations.

Adding Custom Spans and Attributes for Business Logic
The real power of distributed tracing comes from adding business context to your spans. Auto-instrumentation tells you "this HTTP request took 200ms." Custom attributes tell you "this HTTP request processed order #12345 for premium customer alice@example.com and charged $299.99."
When investigating production issues, I want to filter traces by customer type, order value, or payment method. Here's how I add meaningful attributes:
async function chargeCard(userId: string, amount: number, paymentMethod: string) {
const span = tracer.startSpan('chargeCard');
// Add business attributes
span.setAttributes({
'payment.amount': amount,
'payment.currency': 'USD',
'payment.method': paymentMethod,
'user.id': userId,
'user.tier': await getUserTier(userId),
});
try {
const result = await paymentGateway.charge({
userId,
amount,
method: paymentMethod,
});
span.setAttribute('payment.transaction.id', result.transactionId);
span.setAttribute('payment.status', result.status);
return result;
} catch (error) {
span.recordException(error);
span.setAttribute('payment.error.code', error.code);
throw error;
} finally {
span.end();
}
}Now when payments fail, I can search for all failed transactions over $100, or all failures for premium users, or all failures with a specific payment method. This is wonderful for identifying patterns.
Visualizing Traces with Jaeger, Tempo, and Other Backends
OpenTelemetry just collects and exports traces. You need a backend to store and visualize them. I've used Jaeger and Grafana Tempo in production.
Jaeger is easier to set up locally. You can run it with Docker:
docker run -d --name jaeger \
-p 16686:16686 \
-p 4318:4318 \
jaegertracing/all-in-one:latestThen point your OTLP exporter to http://localhost:4318/v1/traces and visit http://localhost:16686 to see the Jaeger UI.
Grafana Tempo scales better for production. It stores traces in object storage (S3, GCS) which is much cheaper than Jaeger's storage. The tradeoff is that Tempo has fewer search capabilities by default—you typically query it through Grafana.
In production, I use Tempo with Grafana for cost reasons. For local development, Jaeger's UI is faster and easier to work with.
Production Best Practices: Sampling, Performance, and Privacy
When I deployed distributed tracing to production for the first time, I made several mistakes. Here's what I learned:
Sampling is essential. Tracing every single request creates massive data volumes and costs. Use head-based sampling for low-traffic services (trace 100% of requests) and tail-based sampling for high-traffic services (trace all errors, slow requests, and a sample of successful requests).
const sdk = new NodeSDK({
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1), // Sample 10% of traces
}),
// ... other config
});Performance overhead is real but manageable. OpenTelemetry adds about 5-10ms latency per request in my testing. For most services, this is acceptable. For ultra-low-latency services, use aggressive sampling.
Privacy matters. I was once guilty of putting user email addresses and credit card numbers (even masked ones) in span attributes. Don't do this! Traces might be stored for weeks or months. Only include non-sensitive identifiers like user IDs and order IDs.
Attributes over events. I used to create span events for everything. Attributes are cheaper to query and store. Use events only for timestamped occurrences within a span, like "retry attempted" or "cache miss."
Set up alerts on trace errors. Your tracing backend can alert when error rates spike or latency increases. This gives you early warning before customers complain.
The beautiful thing about distributed tracing is that once it's set up correctly, it becomes your go-to tool for understanding system behavior. I check traces before logs now.
And that concludes the end of this post! I hope you found this valuable and look out for more in the future!