Infrastructure Monitoring – Part 2: Industry Leaders and Selection

This part two of the series will focus on providing an overview of the major vendors in the infrastructure monitoring space, help narrow down your options, and talk about the benefits of doing a Proof of Concept.

Part 1 – Introduction and Requirements
Part 2 – Industry Leaders and Selection
Part 3 – Effective Monitoring
Part 4 – Implementation and Discovery
Part 5 – Dashboards, Reports, and Access
Part 6 – Continuous Improvement

I have not personally used every single vendor in this list in a production environment. All of them I have heard industry feedback and consensus on, some of them I work with on a daily basis, and some of them I have simply downloaded and tested. There are simply too many industry options, therefore this list focuses on internal self-hosted solutions.

The Vendor List (In Approximate Order of Cost)

Nagios Core is a the free open source “central” functionality of many of the inexpensive/free monitoring options. The free product includes the core monitoring platform and basic event/alert functionality. Most configuration and interaction with the system is done via text configuration files with most of the data viewable via a web interface. This option, while free, will definitely require a LOT of hands-on time during initial setup. Most “check” functionality is provided by many community maintained scripts/plugins you will have to integrate into the check engine. If your monitoring requirements are not complex, are fairly rigid, and new system additions are infrequent you may be able to get away with Nagios Core. You will need to have a moderate amount of Linux experience to be successful with this solution.

Icinga, like several other entries, took the OSS foundation of Nagios Core and overlaid more feature-full monitoring functionality. They’ve since rewriting the core in their 2.x release effectively leaving behind their direct connection to Nagios. Those familiar with Nagios products will definitely see terminology carryovers such as host groups, flap detection, and host/service checks. The Web 2.0 interface is an obvious improvement over Nagios Core, as well as scalability as it supports a distributed monitoring architecture out of the box. Icinga is free, but funds its development via paid support contracts.

Zabbix has been around since it’s initial 1.0 release in 2004, is Open Source, and monetizes itself via the traditional “free but pay for support” method. Zabbix touts it’s scalability and flexible permission options along with wide agent-less operating system monitoring compatibility. Unlike some of the Nagios-based products, Zabbix has many feature-rich checks built in including user emulation (such as response times during a log in and log out cycle).

Nagios XI is the paid and heavily extended version of Nagios Core maintained by the Nagios Core developers. Nagios XI takes you out of the realm of text configuration into the web interface where you do most of your day to day work. You’ll still need a decent amount of Linux experience with some of the less common tasks (such as installing a new plugin for new monitoring functionality and troubleshooting). As such, the web interface is a huge improvement over Nagios Core especially for graphic visualizations. Many of the same monitoring capabilities are provided via the same community-driven Nagios Exchange. Nagios XI functions as a great introduction to “Enterprise Monitoring” with features including bulk import, auto discovery, configuration wizards, authentication integration, and reporting.

Paessler Router Traffic Grapher (or PRTG) positions itself as a competitor to the Nagios XI, Icinga, and Zabbix solutions. Their transparent pricing allow for a free full-feature deployment for 100 sensors (Note: A sensor a specific metrics on a specific host). One unique feature of PRTG is they support both a web interface as well as a native windows “thick client” for managing/viewing your sensors. They also support PRTG server clustering for failover or for redundant monitoring of the same service from multiple locations.

Opsview is another monitoring solution whose roots lie with Nagios Core back in 2003. Opsview is probably the most feature-rich of the Nagios-based monitoring solutions. They have fair penetration into the enterprise space as well as flexible licensing options for <25 Hosts (Free!), <300 Hosts, and Unlimited Hosts. Opsview has placed itself in an excellent sweet spot between small businesses who have out-grown heavy hands-on monitoring (Nagios, custom scripts) but do not have the need or funds to implement any of the “big iron” monitoring suites.

SolarWinds Orion brings together the suite of monitoring products which the most relevant of which are Network Performance Monitor (NPM), and Server & Application Monitor (SAM). Their other product offerings dive deeper into various areas of IT Monitoring including Virtualization, Storage, Application/Database Performance, Patch Management, etc. Most of their tools have varying levels of integration with their central web interface with NPM/SAM being heavily intertwined. This tool has an intuitive web interface, though certain aspects can be challenging to set up without experience. The biggest complaint I see around SolarWinds is that they’re acquiring more functionality than they can integrate into their core interface. Some of their addons “exchange” data with the central interface, but also have their own more feature-rich interface. This is a tool you’ll definitely want time to experiment with before you import every server for monitoring. It is by far the most feature-rich of the options on this page and is therefore likely the most expensive.

Honorable Mentions

NetCrunch – NetCrunch combines the monitoring of network infrastructure devices like: switches, routers and printers with the monitoring of servers, applications and virtualization hosts.
ScienceLogic – Complete Hybrid IT Monitoring – Complete monitoring for power, network, storage, servers, applications, and the public cloud.
NewRelic SERVERS – Server monitoring from the app perspective – See how apps perform in the context of your server health.
ManageEngine – Application Performance Monitoring across physical, virtual and cloud environments.
Observium – A low-maintenance auto-discovering network monitoring platform supporting a wide range of device types, platforms and operating systems.
SevOne – The patented SevOne Cluster architecture leverages distributed computing to scale infinitely and collect millions of metrics, flows, and logs while providing real-time reporting down to the second.
CA Unified Infrastructure Management – A single, scalable platform for monitoring servers, applications, networks, databases, storage devices and even the customer experience.
op5 Monitor – Monitor every server, from the cloud to the basement. If you are in need of control, we have the solution.
Zenoss – An award-winning open source IT monitoring product that offers visibility over the entire IT stack, from network devices to applications.
Check_MK – Comprehensive Open-Source-Solution for IT-Monitoring developed around the proven Nagios-core.
Cacti – A complete network graphing solution designed to harness the power of RRDTool’s data storage and graphing functionality.

Narrowing Down Contenders

After reading through the above list and doing additional market research of your own, a few should jump out as potential products to pursue. If nothing on this page seems like what you’re looking for, you may not be looking for infrastructure monitoring software, but rather something specific to your needs including application monitoring, or database performance monitoring. Browse the vendors websites focusing on the screenshots and see if you can evaluate the products to narrow down further which meet your requirements from Part 1.

The process should be very iterative between your requirements and what you’re finding that meets your budget. You may find your requirements called for a solution that is priced out of your budget. You should go back to your requirements and tweak them to find what meets your needs. Maybe you can go without monitor X system, or maybe you can spend a little more time during setup to reduce the up-front investment.

At this point depending on the scale of your project you may choose to dive right in to exploring each product live in your environment, or you may choose to contact the vendor to further narrow down your options. Before making a purchase I would highly recommend getting some hands-on time with the tool if possible.

Proof of Concept

You will not have the time to do a PoC (Proof of Concept) on each vendor you find but if you narrow down to 1-3 solutions (preferably 2) you can spend significant time getting to know each solution. Your PoC should end up with a full installation of the tool in your environment either in trial-mode or with a temporary license. They key to the PoC process is NOT to monitor your entire environment, but pick a few complex systems to fully monitor to get a feel for the complexity. The challenge of these tools isn’t to “check if a server is up or down”, it’s to configure the tool in a way that gives YOU the most relevant data.

Here are a few tests I would suggest to compare the systems:

How hard was the installation process? Were there a lot of issues that may deter you from the solution?
How hard/easy is to to add a new system?
Is the interface fairly responsive and error-free?
If you have a set of servers which work together (such as an app and database pair) try to monitor them both, their OS metrics, and the services that run on them. Don’t go overboard but monitor items which may be “actionable”.
How does the dashboarding work? Can you quickly create a simple dashboard to show the status of your environment?
How are permissions handled? Do you have very granular permissions you can configure for team members or is it a simple Read-Only or Administrator?
How are alerts configured? Is it easy to change who gets alerted for a server, or when?

Selection

Once you’ve had hands-on time with the different solutions, hopefully you will have a good idea of which product is the best fit for your needs. If you’re still having trouble deciding between two close solutions you should consider putting your requirements into a weighted list. Assign an importance/weight to each requirement and then rate each solution from 0 to 10 on that requirement. Multiply the weight by the rating and add up the numbers for each solution. The product with the highest number gives you the best solution based on your requirements and their perceived importance. Don’t forget to factor in other deciding factors including cost, time to deploy, and ongoing maintenance. See this article for more information on creating a weighted matrix: Toolbox for IT: Constructing a Weighted Matrix

Depending on the size of the solution’s organization and pricing flexibility you may be able to leverage large discounts off of list price. Without going into excessive detail, during the purchasing process you should work with someone from your organizations procurement team to ensure you get the best possible cost for the product and fully understand the licensing model. If you don’t have access to a procurement resource then treat this purchase like buying a new car. Each vendor wants your business and don’t be afraid to hint at the fact you’re considering competing products. It usually worth at least testing to waters to see if they can drop their final price or throw in additional functionality to “sweeten the deal”.

Conclusion

At this point you should have a product selected, licenses in hand and preparing to deploy to your environment.

In this part we accomplished the following:

Took a brief tour of the major monitoring solutions (strengths/weaknesses)
Explored the iterative process of selecting the right fit for your requirements
Discussed the benefit of doing a “Proof of Concept”
Selected the correct solution for our environment

In Part 3 we’ll cover the following:

What makes a monitoring solution effective? How can it be the MOST effective?
What types of things we want to monitor and why
What do we want to be alerted on?

4 Comments

John April 25, 2016 at 11:10 AM - Reply

You missed Sensu on this list.
Axel Amigo May 10, 2016 at 8:16 AM - Reply

Pandora FMS… does it ring a bell? 🙂
Eric May 25, 2016 at 10:46 AM - Reply

How about EventSentry
http://www.eventsentry.com/about/livedemo
Daniel July 13, 2016 at 9:00 AM - Reply

The article is interesting as a starting point, but you don’t seem to be considering the complex hybrid IT. I know you’re talking about Infrastructure monitoring, but the truth is that what most companies need is to reduce their alert and monitoring SW’s and get a unified solution that fits 80% of their needs, and leaves them the time and effort to focus on the 20% customisation needed to achieve their business specific needs. It’s a very complex problem, i understand, and your posts help bring some light and your perspective, thanks for that. In my opinion, if you’re thinking of deploying something that helps monitoring and alerting for the whole company, from the network switch to the payments app on the webshop and provide a root cause analysis to really help pinpoint and speed up problem solving, then my choices for a PoC right now are leaning towards: ScienceLogic, Zenoss and DataDog (linked with NewRelic that we already use for sw monitoring). Do you have an opinion about this?

The Vendor List (In Approximate Order of Cost)

Honorable Mentions

Narrowing Down Contenders

Proof of Concept

Selection

Conclusion

About the Author: Caesar Kabalan

Infrastructure Monitoring – Part 3: Effective Monitoring

Remote Task and Service Auditing

Computer Imaging: A Short Analysis

The Job Search: Part 2: Attracting (Pre-Interview)

The Job Search: Part 1: Introduction

4 Comments

Leave A Comment Cancel reply

Recent Posts

Archives

Recent Comments

Categories

SpectralCoding

Infrastructure Monitoring – Part 2: Industry Leaders and Selection

The Vendor List (In Approximate Order of Cost)

Honorable Mentions

Narrowing Down Contenders

Proof of Concept

Selection

Conclusion

Share This Story, Choose Your Platform!

About the Author: Caesar Kabalan

Related Posts

Infrastructure Monitoring – Part 3: Effective Monitoring

Remote Task and Service Auditing

Computer Imaging: A Short Analysis

The Job Search: Part 2: Attracting (Pre-Interview)

The Job Search: Part 1: Introduction

4 Comments

Leave A Comment Cancel reply

Recent Posts

Archives

Recent Comments

Categories

SpectralCoding