Observability is a concept related to how much of a system can be understood from the signals it emits. It indicates how prepared you are to verify its behavior when a successful operation happens, or even more important, when something goes wrong. For more details about Observability concepts and how it compares to Monitoring, please check our previous article. This article presents OpenTelemetry and how it can be used while implementing observability.
OpenTelemetry – What Is It?
Whenever we talk about observability and start defining the practical aspects applied to our systems, we need to think about some challenges:
- Which signals should we produce and how to produce them in our applications?
- Where should we store these signals and how can we propagate them from the app to storage?
- How can we analyze and explore the signals to understand our application behavior?
When thinking about technologies that will help address these challenges, we can consider OpenTelemetry. According to the main project website :
“OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.”
The statement above is broad, but by the end of this article, its meaning and the potential application of OpenTelemetry will become clear.
Signals – The Information Produced by the Application
As part of its core API, OpenTelemetry defines three main signals: Logs, Metrics, and Traces. The API defines how each of the signals should be produced and what are the fields or parts that compose them (i.e., the signal data model). One supplementary goal is to be able to correlate all the signals produced, so we should see some fields with this purpose.
Let’s dig a little more into Logs, Metrics, and Traces.
Traces represent the flows that each request executes in the system and can touch several different components. Each of the “slices” of a Trace is called a Span and they can have multiple types of relationships. For example, a Span can be directly created by another Span, like when an HTTP request is performed, but it can also be generated when some condition is fulfilled, like when multiple parts of a task (each one with its own Span) are completed and that allows another task to start (and hence a new Span is created). Traces are useful to understand how components interact with each other and the time spent in each component. Figure 1, extracted from the OpenSearch Observability Plugin, shows a trace that spans multiple components:
Figure 1 – Example of a trace that spans multiple components (source: https://opensearch.org/docs/latest/observing-your-data/trace/index/)
Logs are essentially text fragments produced by an application or system and contain a timestamp of when they were generated. It is as simple as that, but OpenTelemetry adds the SpanContext to it, which combines the Span ID and the Trace ID, and this allows the correlation between Logs with Traces.
The correlation between Logs and Traces is really important because it allows you to use the Trace ID to filter logs that happened for a specific request. In systems that have too much activity, this can save a lot of time while trying to understand the influence that specific parameters have on the request duration and also identify what can lead to a failure further inside your system.
Metrics are the third type of signal, and it is related to the current state of some aspects of the system. In other words, it’s a picture of what the system looks like at some point in time. Even though it has this “static” nature, Metrics really shine when we put them in perspective and check how they change through time. A CPU or memory utilization information is nice to have as a single measurement but it can have even more value if we analyze the variation while some operation is going on.
Since Metrics can represent the system state instead of an application state, the most common way to correlate Metrics with Logs and Traces is by using their timestamps. It makes sense because of the indirect relationship between a specific Trace and the effect on the system that produced it. However, there are cases where this relationship between a Metric and a Trace is direct. This can happen, for example, when we measure the time of a specific processing step of a given HTTP request where the step processing time varies depending on the parameters received in the related HTTP request. In this situation, the combination of a Metric with the corresponding SpanContext is called an Exemplar and we can have both the “individual” aspect from the Exemplar and the “aggregate” aspect from the Metric.
It’s possible to have a good understanding of a system with these 3 signals. So, it is time to discuss how to generate them using OpenTelemetry.
Instrumentation – The mechanism to generate our signals
However, the signals that can be generated by the automatic instrumentation may not be enough to describe all the behaviors and nuances of the application, so OpenTelemetry also provides APIs and SDKs to allow developers to integrate signal generation and propagation with their code. This adds flexibility and unlocks the power to tailor signals to add specific information that can be used to better understand the internal workings of the application.
OpenTelemetry is a set of tools that can optimize our efforts during our journey to enhance the observability of our systems. Using it gives us a clear path about what we should observe from our applications and how to collect and propagate these observations:
- Traces should be used to describe our components' interactions.
- Logs should show the application's internal behaviors and should be enhanced with tracing context whenever possible.
- Metrics will describe our system state and should have the same time reference as the other signals for easier correlation.
- Instrumenting our applications for basic OpenTelemetry usage should be simple and achieved by using automatic instrumentation.
- Applications that require more should leverage manual instrumentation. It requires code changes but provides more flexibility.
This piece was written by Marcelo Mergulhão, an Innovation Expert at Encora’s Engineering Technology Practices group, and João Longo, an Innovation Leader at Encora’s Engineering Technology Practices group. Thanks to João Caleffi and André Scandaroli for reviews and insights.
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.