OpenTelemetry metrics is well on its way to approaching general availability, and you’ll want to understand this signal to implement it as part of your team’s observability strategy. Currently, you can collect some application runtime metrics out of the box with several language SDKs, and you can use the host metrics receiver to generate metrics about your host system.
If you want to generate and collect metrics beyond those, you’ll need to learn about metric instruments, types, and their use cases, including what you need to consider when choosing one. For example, you might want to know the number of active users of your application so you can better understand customer behavior.
To help frame these concepts, you can follow along with examples in our fork of the OpenTelemetry Community Demo application, which is an online shop selling a range of tools for stargazing. If you're familiar with general metrics concepts, feel free to move ahead to the metric instruments section.
What is a metric?
Before we get started with OpenTelemetry metrics, let’s start with metrics in general. A metric is simply a measurement of a service that is captured at runtime. You can aggregate these measurements further to identify trends and patterns over time. Here are standard application and resource utilization metrics that are generally important when you’re developing apps:
- Throughput
- Response time/latency
- Error rate
- CPU utilization
- Memory utilization
Getting a little more complex, you might be interested in custom metrics to better understand your app and user behavior, or to track specific key performance indicators (KPIs). For example, if you’re the owner of the astronomy online store, here are some example custom metrics you might want to collect:
- Total number of checkouts
- Latency of search results being returned
- The distribution of orders by size
- Number of abandoned shopping carts per day
Why are metrics useful?
More specifically, you might be wondering why metrics are useful for observability in general, and what characteristics make them more useful than logs or traces.
All three signals—metrics, logs, and traces—are useful for monitoring the overall health and performance of your application. You can use them for span data or, more commonly, powering data visualizations.
But here’s where data from metrics really shine:
- Data volume reduction: Exporting and analyzing measurements individually can be expensive. By aggregating measurements, you can reduce your overall data volume while still gaining insight from the data.
- Alerts: Metrics form the basis of service level indicators (SLIs), which measure the performance of an application. You use the indicators to set service level objectives (SLOs) that teams use to calculate their error budgets. Metrics make a big difference if you use them to create alters for breached SLOs.
Overview of metrics concepts
To help you understand the metric instruments available with OpenTelemetry, let’s review six mathematical concepts at a high level.
Aggregation
Aggregation is the process of combining multiple measurements into one metric point. For example, let's say you have a set of 30 measurements, each representing a daily total number of telescopes sold. You could aggregate these totals to produce a single number, which would tell you how many of those telescopes you sold in the given time period (30 days).
Temporality
The notion of temporality dictates how you aggregate. It relates to whether the reported values of additive quantities—values that are summed together—incorporate previous measurements or not. There are two types of temporality:
- Cumulative temporality indicates that measurements are accumulated when exported. Another way to look at cumulative temporality is that the start time is always the same. If your application restarts, it would reset to 0 and the start time would begin from the time of the application restart.
- Delta temporality indicates that measurements are reset each time they’re exported, which means you’re seeing the change in a measurement instead of the absolute value. Another way to look at delta temporality is that it has a constantly moving start time.
Monotonicity
There are two kinds of values related to monotonicity:
- Monotonic refers to a value that is always increasing. For example, your total number of telescopes sold over time is monotonic.
- Non-monotonic refers to a value that is increasing and decreasing at the same time. For example, the number of telescopes sold from day to day will likely fluctuate so the value is non-monotonic. (Although as a business owner, you’d certainly like for this to be a monotonic sum!)
Now, let’s take a look at the metric types that result from aggregation:
Sum
A sum is an addition of values. A sum can have a temporality of either:
- Cumulative (It never resets.)
- Delta (It can reset and bring the state back to 0.)
Histogram
A histogram is a distribution of data consisting of buckets and counts of instances within those buckets. In OpenTelemetry, the term histogram refers to both an instrument type as well as an aggregation, and there are two types of histograms that are supported:
- Explicit bucket histograms have buckets that are explicitly defined during initialization.
- Exponential histograms also have buckets and bucket counts, but the bucket boundaries are computed based on an exponential scale. Learn more at OpenTelemetry exponential histograms.
Last value
Temporality does not matter here. Since you're always just sending the last value, it doesn’t matter if you reset the state or not.
If you're looking for a more in-depth guide to some of these concepts, see Understand and query high cardinality metrics in the New Relic documentation.
Why use OpenTelemetry for metrics
I’m going to answer this question by talking about the design goals of OpenTelemetry:
- To provide the ability to correlate metrics to other signals. For example, you can correlate metrics to traces via exemplars, and enrich metrics attributes with baggage and context.
- To provide a path for OpenCensus users to migrate to OpenTelemetry. This was part of the original goal when OpenCensus and OpenTracing were merged to create OpenTelemetry back in 2019.
- To work with existing metrics instrumentation protocols and standards, with the minimum goal being to provide full support for Prometheus and StatsD.
The biggest benefit is that OpenTelemetry grants you freedom from vendor lock-in. You can instrument your applications once, and then send your telemetry to the backends of your choice.
Metric instruments, types, and use cases
You use an instrument to report measurements. In OpenTelemetry, there are six metric instruments, and each has an aggregation strategy (also called an aggregation) that reflects the intended use of the measurements it reports. The instrument type you select determines how the measurements are aggregated, and ultimately the type of metric that is exported, which affects the way you can query and analyze it.
So how do you choose the right metric instrument? Let’s look at it from another angle: different aggregations support different modes of analysis. For example, maybe you want to analyze the latency of search results being returned when your customers are searching for a product on your site. You’d want a format for the measurements to be useful for you to obtain insight. In this case, a sum of these measurements doesn't make sense, because you can’t figure out anything useful from that value. You’d want a histogram, so you can see a distribution of search response times. So, you’d want to select an instrument that will produce a histogram.
Here's a brief framework for how to select an instrument:
- How do you want to analyze the data?
- Does the measurement need to be done synchronously?
- When you use a synchronous instrument, an instance of the instrument is called when the event that you're measuring occurs.
- In contrast, an asynchronous instrument only records a measurement once per set interval.
- Whether to use one or the other boils down to convenience: Is it easier for you to access the data at the point of instrumentation, or would you rather have it reported on a specified interval?
- Are the values that the instrument records monotonic?
To help you decide which instrument type to use, take a look at this table, which includes properties and examples for each instrument:
Instrument |
Synchronous |
Additive |
Monotonic |
Default aggregation |
Example measurements |
Use when… |
Counter |
✅ |
✅ |
✅ |
Sum |
|
|
Up down counter |
✅ |
✅ |
❌ |
Sum |
|
|
Histogram |
✅ |
❌ |
❌ |
Explicit bucket histogram |
|
|
Async counter |
❌ |
✅ |
✅ |
Sum |
|
|
Async up down counter |
❌ |
✅ |
❌ |
Sum |
|
|
Gauge |
❌ |
❌ |
❌ |
Last value |
|
|
Note: While the OpenTelemetry API provides a default aggregation for each instrument, you can override it using the Views API, which I won't detail here because this is just a 101.
Considerations
Before you begin implementing metrics, there are a couple of things to take into consideration. Let’s start by reviewing the following two concepts in the context of observability:
Dimensions
A dimension refers to an attribute associated with the metrics. For example, if you’re using an instrument to count the number of customers in your telescope shop, you might also want to record information about the customers, such as their location. You'd add this information as a dimension on the measurement.
Dimensions are useful because you can use them to aggregate your data in different ways, as well as to filter on your data.
Cardinality
Metric cardinality refers to the uniqueness of a value on a metric. Using the example of capturing the locations of our customers, let’s say you're collecting the country for each customer.
If your customers happen to be from the same one or two countries, that would be low cardinality. If you collect their city instead, and they are from many different cities, that would result in high cardinality, because the uniqueness of that value has increased. Then, imagine that your app is inundated with traffic because you ran a sale, and now you have customers from all over the world purchasing telescopes from your site. This would result in an increase in the load in the system, sometimes called a cardinality explosion.
When collecting telemetry, managing cardinality is typically a concern. One of the challenges of cardinality is increased storage cost. Additionally, some backends, including New Relic, impose cardinality limits, which might result in your data getting dropped. Read more in our documentation on how to understand and query high cardinality metrics.
Next steps
Now that you’ve got a basic foundation about metrics in OpenTelemetry and how to choose an instrument, you’re ready to collect some metric data!
To learn how to implement application runtime metrics and create an instrument in your OpenTelemetry SDK, check out our manual instrumentation tutorial for Java in the New Relic docs. (More languages are in progress!)
To learn more about the power of exponential histograms, check out our blog post on OpenTelemetry exponential histograms.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.