Graphs Monitoring Apache Hadoop, Cassandra and Zookeeper using Graphite and JMXTrans

I love monitoring. Well actually I don’t, but I realized there’s no sane way to live without it and so I’ve grown to love it.

One of the key components to successfully running big data infrastructure like Apache Hadoop, Apache Cassandra or Apache Zookeeper in production is monitoring the heck out of them. This is crucial in a multitude of aspects. First and foremost is the learning aspect. Looking at these monitoring charts can teach you a lot about the internal workings and behavior of these infrastructures. Second aspect is tuning configuration. For example, when you make a change to Column Family’s caching configuration you need to get insight on how your change affects performance and the overall behavior of the cluster. Third is problem identification, you want to measure the behavior of your infrastructure over time to identify performance degradations, bottlenecks etc,. Fourth is capacity planning and the ability to measure performance as the environment around your infrastructure changes (e.g., traffic growth).

So monitoring is important, very important.

This post will show how I achieved some decent monitoring for these infrastructures using Graphite, JMXTrans some python scripts and Graphitus.

I won’t go into the details of installing Cassandra, Zookeeper, JMXTrans or Graphite. All those are well-documented in their respective sites and across the web. Assuming you have all these up and running, I will show you how to integrate them together.

First thing to check is weather these Java-based infrastructures have their JMX ports open

(erez@cass1.prod:~)$ ps -ef | grep java
/usr/java/default//bin/java -ea ... -Dcom.sun.management.jmxremote.port=9009 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false ... org.apache.cassandra.thrift.CassandraDaemon

So in this case we see that the port is 9009 and that SSL is disabled and no authentication is used. If you enabled authentication JMXTrans can support that (take a look at the examples section). Do this for Zookeeper and Hadoop NameNode, JobTracker, TaskTracker and DataNode Java processes to learn their JMX ports and configuration.

Next we need to generate the JSON files for JMXTrans, for this I wrote a set of python scripts which you can find on github. Go into the relevant directory and edit the genjmxtrans.py script and replace your Graphite URL and the set of Hosts your infrastructure is running on. For Hadoop and Zookeeper you’re all set and you can run the generator script. For Cassandra there’s one extra step. Since some MBeans are Column Family-based you need to provide the script with the Keyspace and Column family configuration. You can provide a static file (see the example file) or use a remote url from a service like Jocassta which reads the Keyspace/CF configuration from Cassandra and exposes it as JSON format:

[
    {
        "name": "my_ketspace",
        "columnFamilies": [
            {
                "name": "MyColumnFamily1"
            },
            {
                "name": "MyColumnFamily2"
            }
        ]
    }
]

When you’re done editing the script. run it and it will generate a bunch of JMXTrans JSON files. Throw these in the JMXTrans configuration folder and it will start collecting metrics from these infrastructures into your graphite server periodically.

If you’re not up to running the script and just want to manually modify the output files take a look at the sample output files for Cassandra, Hadoop and Zookeeper and modify them manually to include your Graphite server URL and your server’s host url.

You could be done right here if you want, you got the metrics in the system and you can start monitoring everything that’s being collected. If you want to take the extra step of making monitoring dashboards for these then read on.

My next step was to define Graphitus dashboards for these. Here are some example configurations:

Cassandra Servers Dashboard: https://gist.github.com/erezmazor/5019989#file-cassandra-servers
Cassandra Server Internals Dashboard: https://gist.github.com/erezmazor/5019989#file-cassandra-servers-internals
Cassandra Per-Column Family Dashboard: https://gist.github.com/erezmazor/5019989#file-cassandra-column-families

Hadoop NameNode Dashboard: https://gist.github.com/erezmazor/5020008#file-hadoop-namenode
Hadoop Jobtracker Dashboard: https://gist.github.com/erezmazor/5020008#file-hadoop-jobtracker
Hadoop TaskTracker Dashboard: https://gist.github.com/erezmazor/5020008#file-hadoop-tasktracker
Hadoop DataNode Dashboard: https://gist.github.com/erezmazor/5020008#file-hadoop-datanode

Zookeeper Dashbaord: https://gist.github.com/erezmazor/5020016#file-zookeeper-servers

These generate dashboards that I look at on almost a daily basis, here are some screen shots:

Cassandra Servers Dashboard:

cassandra servers Monitoring Apache Hadoop, Cassandra and Zookeeper using Graphite and JMXTrans

Zookeeper Servers Dashboard:

zookeeper servers Monitoring Apache Hadoop, Cassandra and Zookeeper using Graphite and JMXTrans

Hadoop JobTracker Dashboard:

hadoop jobtracker Monitoring Apache Hadoop, Cassandra and Zookeeper using Graphite and JMXTrans

Last but not least is the alerting component of this, given that we have these metrics in graphite it is now easy to define alerts on them using a system like Seyren:

seyren cassandra checks Monitoring Apache Hadoop, Cassandra and Zookeeper using Graphite and JMXTrans