In a previous article, I talked about a couple of steps you could take to reduce the “noise” that your monitoring system may be generating. One of the important things mentioned there is to literally monitor all the things. It still holds true and I still believe and often monitor everything I possibly can because we do not know when and where systems break.
Reducing noise however has become so important that in 2015 during the second Checkmk conference I had a small but important thought which I noted down. I am not sure where or even if I had heard the term minimal viable metrics but then and there I sent myself an email to place into my “someday/maybe folder” with the following:
are we monitoring too much? can we infer state and performance by only looking at a subset of the available metrics? upside would be less noise, downsides cold be we miss things we do not know we do not know
That message sparked a couple of debates with my monitoring buddies, professional acquaintances, partners and customers. We are all genuinely interested in reducing the signal to noise ratio, we want to sleep well at night, we want to make sure we are doing things objectively right and so on. But there is a simple fact that as systems have gotten more complex, we’re breaking down monoliths and pushing out micro-services, distributed by their very nature and made somewhat more complex by the multi-cloud nature of our current environment, things are not so easy.
So what do you now do when your typical application has hundreds to thousands of metrics that may or may not be important?
My first response is still to monitor all the things. I may however now add that we should not be alerting or sending notifications for all the things. What I mean is that from those hundreds of metrics if you’re going to do thresholds based monitoring you are going to inevitably get it wrong and will be generating false alarms.
Our solution is to monitor everything as before, make sure the data is in the system if we decide or later identify that certain metrics correlate and together can help us infer state or performance problems. What this boils down to is a more thorough configuration of your metrics and possibly using some form of aggregation (Checkmk’s BI for us or RRDTOOL in some cases) to alert on certain conditions.
MVM in our case is basically identifying the absolute minimum number of metrics that provide insight and visibility into our systems and applications and deliver timely and actionable alerts and notifications. MVM is very subjective, very early and as such requires thinking upfront about the problem.
The process to implementing a successful MVM monitoring is by trial and error, especially today with all the metrics available from every corner and piece of software and hardware running in your infrastructure. What is truly required is the patience to understand how they correlate (if they even correlate) and avoid alerting on every little metric that is outside of its thresholds or observability model. Using the three step process detailed before is going to help you here.
Once you are looking at your hundreds and thousands of metrics you will need to decide which ones are important. If you know already that is just fine but often times you won’t so you will use the scientific method: guess, compute the results of your guess, compare the results to observation. Luckily Checkmk makes this comparison easy, you’ve got the data right there in your dashboards and views.
Iterate a couple of times through your metrics and find the ones that give you the most value (actionable information) and once you are happy feel free to disable notifications for the rest.
MVM has been helping us in the context of monitoring ever more complex systems without drowning in useless alerts. The key is to using a method that is as fail-proof as possible.
We use Checkmk for central monitoring because of its many features including the rule based configuration which allows us to elegantly describe our infrastructure and raise alerts when things deviate from our desired state, the notifications engine which allows us to easily compose and send notifications from the alerts and finally Business Intelligence which allows us to further refine and alert/notify on aggregated data. Integrations with Prometheus and other time-series based tooling has allowed us to get even more metrics but it always boils down to simplifying the problem and reducing the number of metrics to an important and relevant list that infer state. The rest are there just to fill up space on our already limited screens.
If you feel your systems are noisy then take a look at what is generating the noise and taking small incremental steps to reducing that noise. There is no magic bullet, no magic software that automatically understands your systems and their state (although many will sell you exactly that promise, usually with some AI or ML baked in).
If you would like more information about our practical approach to monitoring please reach out and let us know.