Modern applications need quantitative data to understand their behavior and performance. How many requests are we handling? How long do they take? Are we running out of resources? Metrics help answer these questions by providing numerical measurements of your service’s operation.
OpenTelemetry provides several metric instruments, each designed for specific measurement needs. Every instrument is defined by:
http.requests.total
)ms
, bytes
)Let’s explore each type of instrument:
These instruments are called directly in your code when something happens:
Counter A value that only goes up, like an odometer in a car:
requestCounter, _ := meter.Int64Counter("http.requests.total",
metric.WithDescription("Total number of HTTP requests"),
metric.WithUnit("{requests}"))
// Usage: Increment when request received
requestCounter.Add(ctx, 1)
Perfect for:
UpDownCounter A value that can increase or decrease, like items in a queue:
queueSize, _ := meter.Int64UpDownCounter("queue.items",
metric.WithDescription("Current items in queue"),
metric.WithUnit("{items}"))
// Usage: Add when enqueueing, subtract when dequeueing
queueSize.Add(ctx, 1) // Item added
queueSize.Add(ctx, -1) // Item removed
Perfect for:
Histogram Tracks the distribution of values, like request durations:
latency, _ := meter.Float64Histogram("http.request.duration",
metric.WithDescription("HTTP request duration"),
metric.WithUnit("ms"))
// Usage: Record value when request completes
latency.Record(ctx, time.Since(start).Milliseconds())
Perfect for:
These instruments are collected periodically by callbacks you register:
Asynchronous Counter For values that only increase, but you only have access to the total:
bytesReceived, _ := meter.Int64ObservableCounter("network.bytes.received",
metric.WithDescription("Total bytes received"),
metric.WithUnit("By"))
// Usage: Register callback to collect current value
meter.RegisterCallback([]instrument.Asynchronous{bytesReceived},
func(ctx context.Context) {
bytesReceived.Observe(ctx, getNetworkStats().TotalBytesReceived)
})
Perfect for:
Asynchronous UpDownCounter For values that can change either way, but you only see the current state:
goroutines, _ := meter.Int64ObservableUpDownCounter("system.goroutines",
metric.WithDescription("Current number of goroutines"),
metric.WithUnit("{goroutines}"))
// Usage: Register callback to collect current value
meter.RegisterCallback([]instrument.Asynchronous{goroutines},
func(ctx context.Context) {
goroutines.Observe(ctx, int64(runtime.NumGoroutine()))
})
Perfect for:
Asynchronous Gauge For current-value measurements that you periodically sample:
cpuUsage, _ := meter.Float64ObservableGauge("system.cpu.usage",
metric.WithDescription("CPU usage percentage"),
metric.WithUnit("1"))
// Usage: Register callback to collect current value
meter.RegisterCallback([]instrument.Asynchronous{cpuUsage},
func(ctx context.Context) {
cpuUsage.Observe(ctx, getCPUUsage())
})
Perfect for:
Ask yourself these questions:
Common use cases:
Clue automatically instruments several key metrics for your service. These give you immediate visibility without writing any code:
When you wrap your HTTP handlers with OpenTelemetry middleware:
handler = otelhttp.NewHandler(handler, "service")
You automatically get:
When you create a gRPC server with OpenTelemetry instrumentation:
server := grpc.NewServer(
grpc.StatsHandler(otelgrpc.NewServerHandler()))
You automatically get:
While automatic metrics are helpful, you often need to track business-specific measurements. Here’s how to create and use custom metrics effectively:
First, get a meter for your service:
meter := otel.Meter("myservice")
Then create the metrics you need:
Counter Example: Track business events
orderCounter, _ := meter.Int64Counter("orders.total",
metric.WithDescription("Total number of orders processed"),
metric.WithUnit("{orders}"))
Histogram Example: Measure processing times
processingTime, _ := meter.Float64Histogram("order.processing_time",
metric.WithDescription("Time taken to process orders"),
metric.WithUnit("ms"))
Gauge Example: Monitor queue depth
queueDepth, _ := meter.Int64UpDownCounter("orders.queue_depth",
metric.WithDescription("Current number of orders in queue"),
metric.WithUnit("{orders}"))
Let’s look at a complete example that demonstrates how to use different types of metrics in a real-world scenario. This example shows how to monitor an order processing system:
func processOrder(ctx context.Context, order *Order) error {
// Track total orders (counter)
// We increment the counter by 1 for each order, adding attributes for analysis
orderCounter.Add(ctx, 1,
attribute.String("type", order.Type),
attribute.String("customer", order.CustomerID))
// Measure processing time (histogram)
// We use a defer to ensure we always record the duration, even if the function returns early
start := time.Now()
defer func() {
processingTime.Record(ctx,
time.Since(start).Milliseconds(),
attribute.String("type", order.Type))
}()
// Monitor queue depth (gauge)
// We track the queue size by incrementing when adding and decrementing when done
queueDepth.Add(ctx, 1) // Increment when adding to queue
defer queueDepth.Add(ctx, -1) // Decrement when done
return processOrderInternal(ctx, order)
}
This example demonstrates several best practices:
Service Level Indicators are key metrics that help you understand your service’s health and performance. The four golden signals (Latency, Traffic, Errors, and Saturation) provide a comprehensive view of your service’s behavior. Let’s implement each one:
Latency measures how long it takes to serve requests. This example shows how to track request duration in an HTTP middleware:
// Create a histogram to track request durations
requestDuration, _ := meter.Float64Histogram("http.request.duration",
metric.WithDescription("HTTP request duration"),
metric.WithUnit("ms"))
// Middleware to measure request duration
func middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
// Record the duration with the request path as an attribute
requestDuration.Record(r.Context(),
time.Since(start).Milliseconds(),
attribute.String("path", r.URL.Path))
})
}
Traffic measures the demand on your system. This example counts HTTP requests:
// Create a counter for incoming requests
requestCount, _ := meter.Int64Counter("http.request.count",
metric.WithDescription("Total HTTP requests"),
metric.WithUnit("{requests}"))
// Middleware to count requests
func middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Increment the counter with method and path attributes
requestCount.Add(r.Context(), 1,
attribute.String("method", r.Method),
attribute.String("path", r.URL.Path))
next.ServeHTTP(w, r)
})
}
Error tracking helps identify issues in your service. This example counts HTTP 5xx errors:
// Create a counter for server errors
errorCount, _ := meter.Int64Counter("http.error.count",
metric.WithDescription("Total HTTP errors"),
metric.WithUnit("{errors}"))
// Middleware to track errors
func middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Use a custom ResponseWriter to capture the status code
sw := &statusWriter{ResponseWriter: w}
next.ServeHTTP(sw, r)
// Count 5xx errors
if sw.status >= 500 {
errorCount.Add(r.Context(), 1,
attribute.Int("status_code", sw.status),
attribute.String("path", r.URL.Path))
}
})
}
Saturation measures how “full” your service is. This example monitors system resources:
// Create gauges for CPU and memory usage
cpuUsage, _ := meter.Float64ObservableGauge("system.cpu.usage",
metric.WithDescription("CPU usage percentage"),
metric.WithUnit("1"))
memoryUsage, _ := meter.Int64ObservableGauge("system.memory.usage",
metric.WithDescription("Memory usage bytes"),
metric.WithUnit("By"))
// Start a goroutine to periodically collect system metrics
go func() {
ticker := time.NewTicker(time.Second)
for range ticker.C {
ctx := context.Background()
// Update CPU usage
var cpu float64
cpuUsage.Observe(ctx, getCPUUsage())
// Update memory usage using runtime statistics
var mem runtime.MemStats
runtime.ReadMemStats(&mem)
memoryUsage.Observe(ctx, int64(mem.Alloc))
}
}()
Once you’ve instrumented your code with metrics, you need to export them to a monitoring system. Here are examples of common exporters:
Prometheus is a popular choice for metrics collection. Here’s how to configure it:
// Create a Prometheus exporter with custom histogram boundaries
exporter, err := prometheus.New(prometheus.Config{
DefaultHistogramBoundaries: []float64{
1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, // in milliseconds
},
})
The histogram boundaries are crucial for accurate latency measurements. Choose boundaries that cover your expected latency range.
OTLP is the native protocol for OpenTelemetry. Use it to send metrics to collectors:
// Create an OTLP exporter connecting to a collector
exporter, err := otlpmetricgrpc.New(ctx,
otlpmetricgrpc.WithEndpoint("collector:4317"),
otlpmetricgrpc.WithTLSCredentials(insecure.NewCredentials()))
Remember to configure TLS appropriately in production environments.
Follow a consistent pattern to make metrics discoverable and understandable:
<namespace>.<type>.<name>
For example:
http.request.duration
- HTTP request latencydatabase.connection.count
- Number of DB connectionsorder.processing.time
- Order processing durationThe pattern helps users find and understand metrics without referring to documentation.
Always specify units in metric descriptions to avoid ambiguity:
ms
(milliseconds), s
(seconds)By
(bytes){requests}
, {errors}
1
(dimensionless)Using consistent units makes metrics comparable and prevents conversion errors.
Consider these factors to maintain good performance:
Collection intervals: Choose appropriate intervals based on metric volatility
Batch updates: Group metric updates when possible
// Instead of this:
counter.Add(ctx, 1)
counter.Add(ctx, 1)
// Do this:
counter.Add(ctx, 2)
Cardinality growth: Monitor the number of unique time series
Aggregation: Pre-aggregate high-volume metrics
// Instead of recording every request:
histogram.Record(ctx, duration)
// Batch and record summaries:
type window struct {
count int64
sum float64
}
Document each metric thoroughly to help users understand and use them effectively:
Required documentation:
Example documentation:
// http.request.duration measures the time taken to process HTTP requests.
// Unit: milliseconds
// Attributes:
// - method: HTTP method (GET, POST, etc.)
// - path: Request path
// - status_code: HTTP status code
// Update frequency: Per request
// Retention: 30 days
requestDuration, _ := meter.Float64Histogram(
"http.request.duration",
metric.WithDescription("Time taken to process HTTP requests"),
metric.WithUnit("ms"))
For more detailed information about metrics:
OpenTelemetry Metrics The official guide to OpenTelemetry metrics concepts and implementation.
Metric Semantic Conventions Standard names and attributes for common metrics.
Prometheus Best Practices Excellent guidance on metric naming and labels.
Four Golden Signals Google’s guide to essential service metrics.
These resources provide deeper insights into metric implementation and best practices.
Attributes provide context to your metrics, making them more useful for analysis. However, choosing the right attributes requires careful consideration to avoid performance issues and maintain data quality.
Good attributes to include:
customer_type
, order_status
, error_code
These attributes have a limited set of possible values and provide meaningful grouping.subscription_tier
, payment_method
These help correlate metrics with business outcomes.region
, datacenter
, instance_type
These enable operational analysis and troubleshooting.Attributes to avoid:
user_id
, order_id
(use these in traces instead)
These create too many unique time series and can overwhelm your metrics storage.