APRIL 04, 2022

Monitoring and Observability

HiTech

DevOps & Continuous Engineering

DevOps

Technical Blog

Continuous Delivery

AWS

João Pedro São Gregório Silva

The way we build systems is changing, and with it, traditional methods begin to fail to handle modern architectures. One of the really good examples of this is debugging and diagnostics in microservices, where the simple act of attaching a debugger is a nightmare because it’s not only hard to recreate the state of the system, but you also must debug many processes at the same time. The need for change is undeniable, and new ideas and strategies to solve this issue are starting to become popular and well defined.

In this article we are going to discuss monitoring and observability, two concepts that will make the process of solving the most annoying bugs a breeze (or at least a little easier).

Monitoring

Definition

Monitoring is the practice of measuring, consolidating and presenting key metrics of an application, that are relevant to the business and actionable. These key metrics can then be displayed in a dashboard and used to trigger alerts. Choosing what metrics to keep an eye on is usually tricky, especially early on, but you can start with the four golden signals: latency, traffic, errors, and saturation, and expand from there.

“Monitoring everything” is the anti-pattern of this technique, so the struggle is not only to have this technique implemented, but also to know what to track.

What it supports or enables

Monitoring a system allows developers and other stakeholders to be aware of its state. This allows, among other things, to:

Detect resource usage to react with scaling techniques.
Measure the impact of changes: The development team can see, for example, how the behavior of the user base changed after a change in the UI.
Check the system’s health and performance during deployment to help with decisions about triggering a rollback process

Possible challenges

Deciding what metrics should be monitored can be difficult. The four golden signals are your friends, but they should be seen as the bare minimum.
Visualization and alerting should be clean and direct to the point. Over-alerting can desensitize users.
The availability of the metrics and alerting system needs to be better than the availability of the system being monitored.

When to use

If you do not monitor a system, you don’t know if it is healthy or if it is behaving as expected. Monitoring is essential for achieving High Availability and Scalability, being a prerequisite for some Zero Downtime Deployment techniques.

It’s never too late to start monitoring your system, but the earlier you start, the better. Capturing infrastructure metrics like CPU, memory, and disk usage in your development and pre-production environments will help you discover bottlenecks or resource leaks early in your development cycle. It can also help you catch and fix bugs faster by having easy access to metrics like error rates, execution time, and the health of a component.

You can also use it as a quality gate during the software development life cycle. For instance, a policy that prevents the promotion of an artifact that lowers the performance of the system can be implemented. During a blue-green or canary deployment, monitoring can provide information to decide if the newly deployed version is healthy and ready to receive traffic or if it needs to be rolled back.

Alerts will detect when a metric is above a pre-defined threshold, sending a notification to any interested party, like system operators or developers. You can even put in place advanced mechanisms like a bot that will attempt to recover the system. Make an effort to ensure that all alerts have a clear message, stating the problem and the urgency level, otherwise, you can desensitize or confuse the recipient of the notification.

It is also important and very helpful to be able to visualize metrics. Numbers in a table are hard for humans to process and understand but once they are plotted into a graph it can be easy to detect trends, outliers, and anomalies.

Adopting in a greenfield

When starting a project, it is important to decide the metrics collection strategy (push vs pull, which format to use) early on to standardize and reuse the libraries and tools to capture and publish metrics.

It is also important to choose a robust metrics server that can handle the influx of data and that has high availability. If your metrics server fails often, you lose confidence in your metrics and alerts. Consider using your cloud provider solution or other SaaS services like NewRelic or DataDog to avoid having to manage your metrics solution.

Adopting in a brownfield

In a brownfield, the same considerations listed in the greenfield section should be taken. Additionally, since the system is already being operated, it might be easier to detect the most valuable metrics and alarms based on the history of the system’s operation. The more mature a system is, the better its failure modes can be understood. It should be clear which parts of the system are more susceptible to failures. The implementation should start with the most impactful metrics on the less resilient components.

Observability

Definition

Contrary to monitoring, observability‘s main goal is helping with the analysis of a specific problem after it happened. It englobes the capture of detailed logs, exception tracking, tracing, and other metrics that monitoring solutions don’t normally use. Systems can be built to ease observability, in the same sense that they can be built to ease debugging.

What it supports or enables

With observability, developers have enough data to better understand problems, avoiding the need for random fixes, which often leads to not fixing the right issue, wasting time and money.

Possible challenges

The cost of collecting logs and metrics usually increases with the scale of the system.
Be careful with log levels and overall log quality. If there’s too much noise or if logs can’t be correlated with an operation, they are not useful.
Be careful not to build an observability tool that is never used.
Instrumentation may affect how a system behaves and can be the source of bugs or outages.

When to use

If you can detect symptoms but have a hard time identifying the cause of a problem, then you would benefit from Observability. By observing the behavior of a system, you can better understand it and make better decisions.

Observability is complementary to monitoring, one does not eliminate the need for another. Monitoring is used to assess the overall health of a system and is usually limited to a key set of metrics and alerts. Observability, on the other hand, will collect the internal state of a component and how it interacts with other components to observe its behavior.

This technique is useful on systems of any scale, but it excels on large, complex systems. In such a system, the flow of data and the interaction between components are usually more important than the behavior of a single component. Observing how these interactions occur is essential for understanding how the system is behaving.

Adopting in a greenfield

For greenfield projects, it is better to choose a metrics platform that supports features like application performance management, distributed tracing, log aggregation, powerful search, and visualization. The platform should also be fast and easy to use. All these features could be provided by different platforms, but that can become costly and hard to manage.

You should also design your components to be observable. That means adding instrumentation and probes that can be used to introspect its internal state. It is a lot easier to observe a system that was designed for it.

Adopting in a brownfield

In a brownfield, it will be a lot harder to get information from components that were not designed to be observed. To get the maximum value, you should start by observing the interaction between the components of the system. Middleware, such as Service Mesh, are easy targets for instrumentation, increasing the observability by intercepting and rerouting requests.

Log aggregation and log searching should also be set up as soon as possible in a brownfield project and is usually as simple to do as in a greenfield project. Logs are especially helpful when they are structured and can be correlated.

Conclusion

Monitoring and observability are different sides of the same coin, and if you want to develop easy-to-maintain scalable software solutions, they are absolutely necessary.

Even if you are working with a system where scalability is not a concern, observability will still play an important role in helping you track down issues and deal with auditing.

Key Takeaways

The earlier, the better. If you are starting a new project, make sure that these techniques are part of your plans.
Take your time to identify which metrics you should track, but keep in mind that the list will probably change along the time.
The availability of the metrics and alerting system needs to be one of your top priorities. If they fail, you will be left blind to your system’s performance.
The number of logs can grow faster than you expect. Try limiting the overall log collection and purging the unnecessary ones once in a while.

References

Acknowledgement

This article was written by João Pedro São Gregório Silva, Software Developer and co-authored by Isac Sacchi Souza, Principal DevOps Specialist, Systems Architect & member of the DevOps Technology Practice. Thanks to João Augusto Caleffi and the DevOps Technology Practice for reviews and insights.

About Encora

Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.