Supplementary Guidelines

Note: this document is NOT a spec, it is provided to support the Metrics API and SDK specifications, it does NOT add any extra requirements to the existing specifications.

Table of Contents:

Guidelines for instrumentation library authors

Instrument selection

The Instruments are part of the Metrics API. They allow Measurements to be recorded synchronously or asynchronously.

Choosing the correct instrument is important, because:

  • It helps the library to achieve better efficiency. For example, if we want to report room temperature to Prometheus, we want to consider using an Asynchronous Gauge rather than periodically poll the sensor, so that we only access the sensor when scraping happened.
  • It makes the consumption easier for the user of the library. For example, if we want to report HTTP server request latency, we want to consider a Histogram, so most of the users can get a reasonable experience (e.g. default buckets, min/max) by simply enabling the metrics stream, rather than doing extra configurations.
  • It generates clarity to the semantic of the metrics stream, so the consumers have better understanding of the results. For example, if we want to report the process heap size, by using an Asynchronous UpDownCounter rather than an Asynchronous Gauge, we’ve made it explicit that the consumer can add up the numbers across all processes to get the “total heap size”.

Here is one way of choosing the correct instrument:

  • I want to count something (by recording a delta value):
    • If the value is monotonically increasing (the delta value is always non-negative) - use a Counter.
    • If the value is NOT monotonically increasing (the delta value can be positive, negative or zero) - use an UpDownCounter.
  • I want to record or time something, and the statistics about this thing are likely to be meaningful - use a Histogram.
  • I want to measure something (by reporting an absolute value):
    • If it makes NO sense to add up the values across different dimensions, use an Asynchronous Gauge.
    • If it makes sense to add up the values across different dimensions:

Additive property

Monotonicity property

In the OpenTelemetry Metrics Data Model and API specifications, the word monotonic has been used frequently.

It is important to understand that different Instruments handle monotonicity differently.

Let’s take an example with a network driver using a Counter to record the total number of bytes received:

  • During the time range (T0, T1]:
    • no network packet has been received
  • During the time range (T1, T2]:
    • received a packet with 30 bytes - Counter.Add(30)
    • received a packet with 200 bytes - Counter.Add(200)
    • received a packet with 50 bytes - Counter.Add(50)
  • During the time range (T2, T3]
    • received a packet with 100 bytes - Counter.Add(100)

You can see that the total increment during (T0, T1] is 0, the total increment during (T1, T2] is 280 (30 + 200 + 50), the total increment during (T2, T3] is 100, and the total increment during (T0, T3] is 380 (0 + 280 + 100). All the increments are non-negative, in other words, the sum is monotonically increasing.

Note that it is inaccurate to say “the total bytes received by T3 is 380”, because there might be network packets received by the driver before we started to observe it (e.g. before the last operating system reboot). The accurate way is to say “the total bytes received during (T0, T3] is 380”. In a nutshell, the count represents a rate which is associated with a time range.

This monotonicity property is important because it gives the downstream systems additional hints so they can handle the data in a better way. Imagine we report the total number of bytes received in a cumulative sum data stream:

  • At Tn, we reported 3,896,473,820.
  • At Tn+1, we reported 4,294,967,293.
  • At Tn+2, we reported 1,800,372.

The backend system could tell that there was integer overflow or system restart during (Tn+1, Tn+2], so it has chance to “fix” the data.

Let’s take another example with a process using an Asynchronous Counter to report the total page faults of the process:

The page faults are managed by the operating system, and the process could retrieve the number of page faults via some system APIs.

  • At T0:
    • the process started
    • the process didn’t ask the operating system to report the page faults
  • At T1:
    • the operating system reported with 1000 page faults for the process
  • At T2:
    • the process didn’t ask the operating system to report the page faults
  • At T3:
    • the operating system reported with 1050 page faults for the process
  • At T4:
    • the operating system reported with 1200 page faults for the process

You can see that the number being reported is the absolute value rather than increments, and the value is monotonically increasing.

If we need to calculate “how many page faults have been introduced during (T3, T4]”, we need to apply subtraction 1200 - 1050 = 150.

Semantic convention

Once you decided which instrument(s) to be used, you will need to decide the names for the instruments and attributes.

It is highly recommended that you align with the OpenTelemetry Semantic Conventions, rather than inventing your own semantics.

Guidelines for SDK authors

Aggregation temporality

The OpenTelemetry Metrics Data Model and SDK are designed to support both Cumulative and Delta Temporality. It is important to understand that temporality will impact how the SDK could manage memory usage. Let’s take the following HTTP requests example:

  • During the time range (T0, T1]:
    • verb = GET, status = 200, duration = 50 (ms)
    • verb = GET, status = 200, duration = 100 (ms)
    • verb = GET, status = 500, duration = 1 (ms)
  • During the time range (T1, T2]:
    • no HTTP request has been received
  • During the time range (T2, T3]
    • verb = GET, status = 500, duration = 5 (ms)
    • verb = GET, status = 500, duration = 2 (ms)
  • During the time range (T3, T4]:
    • verb = GET, status = 200, duration = 100 (ms)
  • During the time range (T4, T5]:
    • verb = GET, status = 200, duration = 100 (ms)
    • verb = GET, status = 200, duration = 30 (ms)
    • verb = GET, status = 200, duration = 50 (ms)

Let’s imagine we export the metrics as Histogram, and to simplify the story we will only have one histogram bucket (-Inf, +Inf):

If we export the metrics using Delta Temporality:

  • (T0, T1]
    • dimensions: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • dimensions: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T1, T2]
    • nothing since we don’t have any Measurement received
  • (T2, T3]
    • dimensions: {verb = GET, status = 500}, count: 2, min: 2 (ms), max: 5 (ms)
  • (T3, T4]
    • dimensions: {verb = GET, status = 200}, count: 1, min: 100 (ms), max: 100 (ms)
  • (T4, T5]
    • dimensions: {verb = GET, status = 200}, count: 3, min: 30 (ms), max: 100 (ms)

You can see that the SDK only needs to track what has happened after the latest collection/export cycle. For example, when the SDK started to process measurements in (T1, T2], it can completely forget about what has happened during (T0, T1].

If we export the metrics using Cumulative Temporality:

  • (T0, T1]
    • dimensions: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • dimensions: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T0, T2]
    • dimensions: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • dimensions: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T0, T3]
    • dimensions: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • dimensions: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)
  • (T0, T4]
    • dimensions: {verb = GET, status = 200}, count: 3, min: 50 (ms), max: 100 (ms)
    • dimensions: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)
  • (T0, T5]
    • dimensions: {verb = GET, status = 200}, count: 6, min: 30 (ms), max: 100 (ms)
    • dimensions: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)

You can see that we are performing Delta->Cumulative conversion, and the SDK has to track what has happened prior to the latest collection/export cycle, in the worst case, the SDK will have to remember what has happened since the beginning of the process.

Imagine if we have a long running service and we collect metrics with 7 dimensions and each dimension can have 30 different values. We might eventually end up having to remember the complete set of all 21,870,000,000 permutations! This cardinality explosion is a well-known challenge in the metrics space.

Making it even worse, if we export the permutations even if there are no recent updates, the export batch could become huge and will be very costly. For example, do we really need/want to export the same thing for (T0, T2] in the above case?

So here are some suggestions that we encourage SDK implementers to consider:

  • You want to control the memory usage rather than allow it to grow indefinitely / unbounded - regardless of what aggregation temporality is being used.
  • You want to improve the memory efficiency by being able to forget about things that are no longer needed.
  • You probably don’t want to keep exporting the same thing over and over again, if there is no updates. You might want to consider Resets and Gaps. For example, if a Cumulative metrics stream hasn’t received any updates for a long period of time, would it be okay to reset the start time?

In the above case, we have Measurements reported by a Histogram Instrument. What if we collect measurements from an Asynchronous Counter?

The following example shows the number of page faults of each thread since the thread ever started:

  • During the time range (T0, T1]:
    • pid = 1001, tid = 1, #PF = 50
    • pid = 1001, tid = 2, #PF = 30
  • During the time range (T1, T2]:
    • pid = 1001, tid = 1, #PF = 53
    • pid = 1001, tid = 2, #PF = 38
  • During the time range (T2, T3]
    • pid = 1001, tid = 1, #PF = 56
    • pid = 1001, tid = 2, #PF = 42
  • During the time range (T3, T4]:
    • pid = 1001, tid = 1, #PF = 60
    • pid = 1001, tid = 2, #PF = 47
  • During the time range (T4, T5]:
    • thread 1 died, thread 3 started
    • pid = 1001, tid = 2, #PF = 53
    • pid = 1001, tid = 3, #PF = 5

If we export the metrics using Cumulative Temporality:

  • (T0, T1]
    • dimensions: {pid = 1001, tid = 1}, sum: 50
    • dimensions: {pid = 1001, tid = 2}, sum: 30
  • (T0, T2]
    • dimensions: {pid = 1001, tid = 1}, sum: 53
    • dimensions: {pid = 1001, tid = 2}, sum: 38
  • (T0, T3]
    • dimensions: {pid = 1001, tid = 1}, sum: 56
    • dimensions: {pid = 1001, tid = 2}, sum: 42
  • (T0, T4]
    • dimensions: {pid = 1001, tid = 1}, sum: 60
    • dimensions: {pid = 1001, tid = 2}, sum: 47
  • (T0, T5]
    • dimensions: {pid = 1001, tid = 2}, sum: 53
    • dimensions: {pid = 1001, tid = 3}, sum: 5

It is quite straightforward - we just take the data being reported from the asynchronous instruments and send them. We might want to consider if Resets and Gaps should be used to denote the end of a metric stream - e.g. thread 1 died, the thread ID might be reused by the operating system, and we probably don’t want to confuse the metrics backend.

If we export the metrics using Delta Temporality:

  • (T0, T1]
    • dimensions: {pid = 1001, tid = 1}, delta: 50
    • dimensions: {pid = 1001, tid = 2}, delta: 30
  • (T1, T2]
    • dimensions: {pid = 1001, tid = 1}, delta: 3
    • dimensions: {pid = 1001, tid = 2}, delta: 8
  • (T2, T3]
    • dimensions: {pid = 1001, tid = 1}, delta: 3
    • dimensions: {pid = 1001, tid = 2}, delta: 4
  • (T3, T4]
    • dimensions: {pid = 1001, tid = 1}, delta: 4
    • dimensions: {pid = 1001, tid = 2}, delta: 5
  • (T4, T5]
    • dimensions: {pid = 1001, tid = 2}, delta: 6
    • dimensions: {pid = 1001, tid = 3}, delta: 5

You can see that we are performing Cumulative->Delta conversion, and it requires us to remember the last value of every single permutation we’ve encountered so far, because if we don’t, we won’t be able to calculate the delta value using current value - last value. And as you can tell, this is super expensive.

Making it more interesting, if we have min/max value, it is mathematically impossible to reliably deduce the Delta temporality from Cumulative temporality. For example:

  • If the maximum value is 10 during (T0, T2] and the maximum value is 20 during (T0, T3], we know that the maximum value during (T2, T3] must be 20.
  • If the maximum value is 20 during (T0, T2] and the maximum value is also 20 during (T0, T3], we wouldn’t know what the maximum value is during (T2, T3], unless we know that there is no value (count = 0).

So here are some suggestions that we encourage SDK implementers to consider:

  • You probably don’t want to encourage your users to do Cumulative->Delta conversion. Actually, you might want to discourage them from doing this.
  • If you have to do Cumulative->Delta conversion, and you encountered min/max, rather than drop the data on the floor, you might want to convert them to something useful - e.g. Gauge.

Memory management

Memory management is a wide topic, here we will only cover some of the most important things for OpenTelemetry SDK.

Choose a better design so the SDK has less things to be memorized, avoid keeping things in memory unless there is a must need. One good example is the aggregation temporality.

Design a better memory layout, so the storage is efficient and accessing the storage can be fast. This is normally specific to the targeting programming language and platform. For example, aligning the memory to the CPU cache line, keeping the hot memories close to each other, keeping the memory close to the hardware (e.g. non-paged pool, NUMA).

Pre-allocate and pool the memory, so the SDK doesn’t have to allocate memory on-the-fly. This is especially useful to language runtimes that have garbage collectors, as it ensures the hot path in the code won’t trigger garbage collection.

Limit the memory usage, and handle critical memory condition. The general expectation is that a telemetry SDK should not fail the application. This can be done via some dimension-capping algorithm - e.g. start to combine/drop some data points when the SDK hits the memory limit, and provide a mechanism to report the data loss.

Provide configurations to the application owner. The answer to “what is an efficient memory usage” is ultimately depending on the goal of the application owner. For example, the application owners might want to spend more memory in order to keep more permutations of metrics dimensions, or they might want to use memory aggressively for certain dimensions that are important, and keep a conservative limit for dimensions that are less important.