A blog about Monitoring and Observability

In the last few days, I was thinking about a bad day I had at work a year (or more?) ago.

=> Bad, bad software.

Suffice to say, things haven't improved and, as somebody with experience in the field of IT observability, both as a programmer and as a consultant, this is starting to drive me crazy.

The state of infrastructure monitoring and observability in the enterprise world is so bad, that I've decided to start a new blog dedicated to the argument that will come online soon.

To be fair, it will not be a classic blog, but more like an online course, where we will talk about both infrastructure (AKA servers, switches, routers etc...) and applications monitoring, both from a theoretical point of view and with tutorials, using zabbix as a solution.

=> The zabbix monitoring system.

We will discuss the argument at length, because many system administrators, more often than not, end with their observability implementations become a nuisance, instead of helping them catch what's important. In fact, way too often, IT professionals are getting crushed by te sheer amount of alerts they receive on a daily basis.

This is due, generally, to a wrong understanding of what monitoring and observability realy means and, in the end, to a wrong configuration of the chosen monitoring system.

More often than not, IT teams have issues because they monitor the wrong set of "things", for the wrong amount of time or, they think that monitoring everything under the Sun is the right path forward, contributing to the noise with an avalanche of alerts received by the poor on-call souls.

Infrastructure and applications monitoring does not live in a vacuum and must be combined with ticketing systems, SLA policies and root cause analysis to get the team's observability story just right.

We will talk about the magic world of automatic remediation, that is: the ability to leverage your monitoring system's features to try and solve issues automatically. There are many techniques that can be used to perform the automatic remediation of many issues, and we will see some simple examples just to get the idea behind it. However, the most important bit where the focus will be, is about leveraging the available tricks in order to reduce the amount of alerts a team get, increasing also the monitored system's availability.

We will move on to describe, lately, ow to make applications automatically send their performance metrics to a zabbix system, by integrating the (rather simple), communications protocol into our example programs.

Plus much, much more ;).

I can't promise how long the series will be, or how much time it will take me to complete it. One thing is for sure: there will be an RSS feed to the new blog and a new chapter will be posted once ready. No big-bang publication, considering I don't intend this to be a book. Expect new material to be published from time to time.

I'll write a small update on this same blog once the first article will be ready, along with the various links.

Stay tuned and I hope you will find the published material useful.