Zoo Leader Election with Zookeeper

Recently I had to implement an active-passive redundancy of a singleton service in our production environment where the general rule is always have “more than one of anything”. The main motivation is to alleviate the need to manually monitor and manage these services, whose presence is crucial to the overall health of the site.

This means that we sometime have a service installed on several machines for redundancy, but only one of the is active at any given moment. If the active services goes down for some reason, another service rises to do its work. Turns out this is actually called leader election. One of the most prominent open source implementation facilitating the process of leader election is Zookeeper.

Originally developed by Yahoo reasearch, Zookeepr is a service providing reliable distributed coordination. It is highly concurrent, very fast and suitable mainly for read-heavy access patterns. Reads can be done against any node of a Zookeeper cluster while writes a quorum-based. To reach a quorum, Zookeeper utilizes an atomic broadcast protocol.

So how does Zookeeper work?

Connectivity, State and Sessions

Zookeeper maintains an active connection with all its clients using a heartbeat mechanism. Furthermore, Zookeeper maintains a session for each active client that is connected to it. When a client is disconnected from Zookeeper for more than a specified timeout the session expires. This means that Zookeeper keeps a pretty good picture of all the animals in its zoo. The key thing here is that Zookeeper knows about the state of its clients and is able to notify others in a consistent, ordered manner about this state and its changes.

Data Model

The Zookeeper data model consists of a hierarchy of nodes, called ZNodes. ZNodes can hold a relatively small (efficiency is key here) amount of data, they are versioned and timestamped . There are several properties a ZNode can have that make them particularly useful for different use cases. Each node in Zookeeper can have the persistent, ephemeral and sequential flags. These determine the naming of the node and its behavior with respect to the client session.

  • The persistent node is basically a managed data bin
  • The ephemeral node exists for the lifetime of its client session
  • The sequential node, when created, gets a unique number (identity, sequence, counter and like a hundred different names you could call this) suffixed to its name

The latter two provide the means to implementing a variety of distribution tasks such as locks, queues, barriers, transactions, elections and other synchronization related tasks.

Events

Zookeeper allows its clients to watch for different events in its node hierarchy. Finally, a database with callbacks! This way clients can get notified of different changes in the distributed state of affairs and act accordingly. These watches are one timers and should be persisted again by the client after notification. The client is also responsible of handling session expiration which means that ephemeral nodes should be re-persisted after an expiration.

Phew, am I babbling today or what? down to business.

Zookeeper requires a lot of boiler plate code, mostly around connectivity and for the majority of the time you will be doing the same things over and over. Luckily Stefan Groschupf and Patrick Hunt wrote a client abstraction called ZkClient. I published a maven artifact for this on OSS so it’s available to our build system. The library also provides a persistent notification scheme in the form of listeners.

Next I cooked up a factory bean for ZkClient and a template style class to act as an applicative abstraction to Zookeeper operations

    <bean id="zkClient" class="org.projectx.zookeeper.ZkClientFactoryBean">
        <property name="ensemble" value="localhost:2181,localhost:2182,localhost:2183" />
        <property name="connectionTimeout" value="2000" />
        <property name="sessionTimeout" value="10000" />
        <property name="stateListeners">
            <list>
                <ref local="zkStatsCollector" />
            </list>
        </property>
    </bean>
    
    <bean id="zkStatsCollector" class="org.projectx.zookeeper.ZookeeperClientStatsCollector" />

    <bean id="zkTemplate" class="org.projectx.zookeeper.ZookeeperTemplate">
        <constructor-arg ref="zkClient" />
    </bean>

The ZooKeeperClientStatsCollector is a listener implementation which collects stats about session connects/disconnects, exported to JMX as an MBean.

Now I can happily create nodes to my heart’s content.

Leader Election

The Zookeeper documentation describes in general terms how leader election is to be performed. The general idea is that all participants of the election process create an ephemeral-sequential node on the same election path. The node with the smallest sequence number is the leader. Each “follower” node listens to the node with the next lower sequence number to prevent a herding effect when the leader goes away. In effect this creates a linked list of nodes. When a node’s local leader dies it goes to election either find a smaller node or becoming the leader if it has the lowest sequence number.

The following image describes a scenario with 3 clients participating in the election process:

Zookeeper Leader Election Leader Election with Zookeeper

Each client participating in this process has to:

  1. Create an ephemeral-sequential node to participate under the election path
  2. Find its leader and follow (watch) it
  3. Upon leader removal go to election and find a new leader, or become the leader if no leader is to be found
  4. Upon session expiration check the election state and go to election if needed

One thing to consider here is the nature of the work being done by the leader. Make sure it’s state can be preserved if its leadership is revoked. Leader loss could be caused by any number of reasons including initiated restarts due to maintenance and releases. It could also be brought about by network partitioning.

Designing services for graceful recovery is a requirement for distributed systems not leader election.

Spring helps here because interception can be used to suppress method invocations of various services based on leadership status. Below is an example of an interception based leadership control:

    <bean id="leaderElectionProxyTemplate" class="org.springframework.aop.framework.ProxyFactoryBean"
        abstract="true">
        <property name="interceptorNames">
            <list>
                <value>leaderElectionTarget</value>
            </list>
        </property>
    </bean>

    <bean id="leaderElectionTarget"  class="org.projectx.zookeeper.election.LeaderElectionTargetInterceptor" />

    <bean id="myService" parent="leaderElectionProxyTemplate">
        <property name="target">
            <bean class="org.projetx.service.MyService">
        </property>
    </bean>

The service myService is the one controlled by leader election, all it’s method are going to be suppressed or invoked based on leadership status.

Another implementation uses a quartz scheduler instance as its target target:

    <bean id="leaderElectionTarget"
        class="org.projectx.zookeeper.election.quartz.SchedulerElectionTarget">
        <constructor-arg ref="myScheduler" />
    </bean>

This implementation puts a quartz scheduler on standby mode when leadership is revoked and resumes it when it’s granted (notice it will not actually stop running tasks, this will be allowed their natural completion, so in effect you may have a scheduled task running on two services due to partitioning scenarios. This means that whatever is scheduled has to be aware of another service possibly doing the same work. This problem can be easily solved with a Zookeeper barrier implementation, more on that in another post.

But Zookeeper’s capabilities extend far beyond Leader Election.

Presence

Another very good use for Zookeeper is maintaining a clear image of who’s alive in the data center. Every service creates an ephemeral node on some presence path. This can then be viewed by the human eye in the Zookeeper Hue Browser, the Rest Server or the Solr Cloud console. You could also use the various client APIs to point your monitoring tools at Zookeeper to know who’s where.

Here’s an example of defining a presence manager with Spring:

    <bean id="presenceNodeFactory" class="org.projectx.zookeeper.presence.PresenceNodeFactory">
        <constructor-arg ref="zookeeperTemplate" />
        <constructor-arg value="${org.projectx.zookeeper.services.root}/${org.projectx.environment}" />
        <constructor-arg value="${zookeeper.entity.name}" />
        <constructor-arg ref="serviceMetaDataProvider" />
    </bean>

    <bean id="presenceManager" class="org.projectx.zookeeper.presence.PresenceManager">
        <constructor-arg ref="presenceNodeFactory" />
    </bean>

    <bean id="serviceMetaDataProvider"
        class="org.projectx.zookeeper.presence.ServiceMetaDataProvider">
        <constructor-arg value="${zookeeper.entity.name}" />
    </bean>

The serviceMetaDataProvider decorates the node with whatever data you’d like to see on the presence node (ip, revision, release numbers, various timestamps and what not).

Configuration

Some applications use Zookeeper to keep configuration stored (considering it’s within the size limits recommended by Zookeeper). This is much better than managing your configuration in property files, xmls or even a database (remember Zookeeper is Fast for reads). Zookeeper also offers access control with ACLs on it’s node structure and this comes in handy when managing configurations. Solr Cloud and Glu use Zookeeper in this way.

If you wish to run your data center the democratic way, where important decisions are made in coordination with other stakeholders, Zookeeper certainly helps.

Leader Election with Spring is on GitHub, Source shown in this post can be found on Gist.

Happy Zookeeping!