This is a quick survey of available tools to monitor a cloud-based (AWS or multi-cloud) infrastructure network. My conclusions and recommendations are at the bottom.
The network monitoring tools market can be broken up into several segments:
- On-premise (office or datacenter oriented) systems dedicated to network monitoring, including servers as well but with particular focus on routing/switching hardware and SNMP.
- As-a-service systems offering comprehensive monitoring on a server (instance) level, with network monitoring included as one of the many supported checks.
- Self-hosted or incomplete systems dedicated to network monitoring on a server (instance) level.
Ideally, I would find a system that gathers its data in a configurable fashion, mainly from individual servers, but can also understand AWS flow logs and combine these two sources of data. I do not want that system to do anything besides network visualization and troubleshooting, since the infrastructure I’m working on already has tools for that. I would also like the system to push some metrics to a monitoring solution (using
statsd) and work well with PagerDuty for managing any detected incidents. The system shouldn’t do “any-to-any” checks (at least not often), since at scale of 1,000–10,000 machines this can generate non-trivial traffic.
Offerings by segment
Systems dedicated to network monitoring
This segment contains tools that evolved from LAN monitoring and are designed primarily to manage and monitor a local network, primarily via configuration of the networking hardware (routers, switches, etc.) Typical features are Windows-only software, heavy emphasis on SNMP as the main protocol for data collection, and compatibility with all the different network hardware vendors. These tools are usually “priced for the enterprise” as well. None of them are particularly suitable for managing an in-the-cloud network, where there are no switches and routers, and the only information is coming from individual machines and flow logs.
- PRTG is a representative solution checking all of the boxes (windows only, SNMP, sniffing of traffic via compatible hardware, etc.)
- Solarwinds Netflow, relies on compatible hardware generating flow data.
- LogicMonitor and its long list of supported hardware, which firmly places it into this segment.
- AppNeta’s marketing website says some of the right things, but it’s difficult to see how exactly it would work in AWS (custom hardware seems to be suggested, and VMWare images are the only other option), and the blinking “new message” from a “sales agent” makes me want to spend as little time on their website as possible.
Comprehensive monitoring systems with network checks
This segment contains full-stack monitoring systems, both hosted and as-a-service, that provide network checks as part of the package. Most systems here are listed without going into detail, because I do not want to adopt yet another full-stack monitoring system - one is plenty.
- Datadog: has per-instance network metrics (not broken down further into destinations or ports) and a TCP check built-in to the agent. To create a TCP destination for the check, I can get a simple echo server up on all machines using, for example, xinetd. This seems to be the easiest way to get something going.
- netdata + prometheus: has a fping plugin for network checks.
- sensu network checks.
- sysdig network checks.
- Monitoring plugins that can be integrated with Icinga, Naemon, Sensu, Nagios, Shinken, and so on and so forth.
- All kinds of CloudWatch-based solutions, like CloudPing (with Lambda) - CloudWatch does not have a native functionality of this sort.
- Zabbix, etc.
This category also includes comprehensive security systems that have some network monitoring capability as well, like ThreatStack. All these tools are similarly out of scope – the only thing worse than running multiple monitoring tools on a single box is running multiple security tools on a single box.
Self-hosted, dedicated server monitoring
- SmokePing: a single-host system that also supports a master/agents configuration. It is built around probes (explicit checks). Many different probes are supported, including the standard ICMP ping and TCP. The implementation is Perl and the web interface runs off CGI scripts. Check data is stored in the filesystem in RRD format, and the system design (2002, last updated 2014, and the major 3.0 rewrite apparently stalled in 2013) predates most modern thinking about high availability and working at large scale (thousands of hosts, not ten or so in Smokeping demos). I think that this tool is great for a small infrastructure (tens of machines), but will run into performance issues at our scale – and it will be a snowflake setup, since backing up, restoring, high availability processes are nonexistent.
- Arachne: an agent for performing TCP-based latency and reachability checks to other agents (with a custom protocol) or to external endpoints. It can be configured locally, or it can regularly pull configuration from a master (called “orchestrator”), but no code for the orchestrator is provided. The tool outputs the results as statsd-formatted metrics, to be fed into another monitoring tool or service. I consider this an “incomplete” system in the sense it is one piece of a larger solution, with other necessary pieces fairly standard and easily available, and some notable matching pieces (like the orchestrator) missing completely. However, completing the puzzle does not appear to be too difficult, and a centralized orchestrator for configuration does not present a scaling problem at our scale (1,000 to 10,000 machines). It can also be a great testing ground for evaluating Honeycomb.
Using whatever my monitoring system provides (Datadog TCP checks to an echo server) seems to be the easiest choice for small groups of machines. Configuration management can be used to generate the configuration for this check.
Longer term, for whole-network monitoring, I could use the approach outlined by the Arachne presentation to define “layers” of reachability checks (AZ, region, WAN), and select a small, fixed subset of each layer for each machine. A proof of concept for this can be configured using CM, but I’d like to see a central orchestrator providing the list of nodes instead of each node doing the same global searches and figuring things out independently. The nice part of the orchestrator approach is being able to switch the implementation quickly and go from Datadog agent checks to, for example, Arachne; and to change selection logic independently of general system configuration management.
It is a bit surprising I couldn’t find anything that offers a complete solution built for the cloud from the ground-up. Perhaps the community can point me at something I missed?