Back when Hostrangers was still quite a small company, only Nagios, Cacti and Ganglia existed at that time in the market as open source monitoring tools. They’re less known now, but Nagios and Cacti are still in a development cycle, even today.
Even though no automation tools existed. Bash + Perl did the job. If you want to scale your team and yourself, automation should never be ignored. No automation – more human manual work involved.
We started out with around 150 physical servers. To compare, to this day we have around 2000 servers including internal purpose VM’s and physical boxes.
All of the aforementioned tools (Nagios, Cacti, and Ganglia) mostly used the SNMP protocol, which in my opinion is horrible.
For networking gear, SNMP is still usable worldwide, but currently, with white-box switches, this becomes bit by bit totally unnecessary.
Instead, run _node_exporter_ or any other exporter inside the switch and expose whatever you need with a human-readable format. Beautiful is better than ugly, right?
We use CumulusOS which is in our case mostly x86 architecture, thus it is absolutely not a problem to run any kind of Linux stuff.
In 2015 when we started automating everything that can be automated, we introduced the Prometheus ecosystem. In the beginning, we had a single monitoring box where Alertmanager, Pushgateway, Grafana, Graylog, and rsyslogd were running there.
During the transition period from the old monitoring stack (NCG – Nagios/Cacti/Ganglia) we used both systems and finally, we rely only on Prometheus.
The new setup improved our resolution time from 5 minutes to 15 seconds, which allows us to have a fine-grained, deep analysis. Even MTTD (Mean Time To Detect) was reduced by a factor of 4.
We have about 25 community metric exporters + some custom written ones like _lxc_exporter_ in at our disposal. Mostly we expose custom business-related metrics using textfile collector.
We also evaluated the TICK (Telegraf/InfluxDB/Chronograf/Kapacitor) stack as well, but we were not happy with it, because of limited functionality at that time and Prometheus looked many ways simpler and matured to implement.
Later in 2017, we started using PagerDuty for paging. We have a weekly 24/7 on-call rotation. In our case, we rotate every fifth or sixth week only – quite comfortable. But we are looking forward to eliminating this duty as a whole. Instead, we will take care of the service we own, because the owner knows the problems best.
The previous year as we grew up our infrastructure N times since 2015 the main bottleneck become Prometheus and Alertmanager. Our Prometheus eats about ~2TB of disk space. Hence, if we restart or push the node using knife we miss monitoring data for a while. Prometheus restart takes about 10-15 minutes. Not acceptable.
Another problem is that if a single location is down we miss monitoring data as well. Thus we decided to implement a highly available monitoring infrastructure: two Prometheus nodes, and two Alertmanagers in separate continents.
Our main visualization tool is Grafana. It’s critically important that Grafana could query the backup Prometheus node if the primary is down. This is easy as that – put an HAProxy in front and accept connections locally.
backend prometheus server prometheus_2a02_4780_9__1234 2a02:4780:9::1234:9090 check fall 3 rise 2 server prometheus_2a02_4780_bad_c0de__1234 2a02:4780:bad:c0de::1234:9090 check fall 3 rise 2 backup option httpchk GET /graph http-check expect rstatus (2|3)[0-9][0-9]
Did you ask about the performance of any middleware? We talk about single digit milliseconds – all good.
Another problem is how can we prevent users (developers and other internal staff) from abusing dashboards overloading the Prometheus nodes. Or the backup node if the primary one is down – thundering herds problem.
To achieve the desired state we gave a chance to Trickster. This speeds-up dashboard loading times incredibly. It caches time series. In our case cache sits in memory, but there are more choices where to store it. Even when the primary node goes down and you refresh the dashboard, Trickster won’t query the second node for the time series which it has cached in its memory. Trickster sits between Grafana and Prometheus. It just talks with the Prometheus API.
Prometheus nodes are independent while Alertmanager nodes form a cluster. If both Alertmanagers see the same alert they will deduplicate and fire once instead of multiple times.
We have plans to run plenty of _blackbox_exporters_ and monitor every Hostrangers client’s website because anything that cannot be monitored cannot be accessed.