Hey Kishore - Thanks for the info. I found an interesting library called jmetric ( http://code.google.com/p/jmxetric) that reads MBeans and publishes their contents to Ganglia and its working pretty well. A simplified config looks like:
<jmxetric-config> <jvm process="Zookeeper"/> <sample delay="60"> <mbean name="org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Leader" pname="ZK"> <attribute name="AvgRequestLatency" type="double"/> <attribute name="MaxRequestLatency" type="double"/> <attribute name="MinRequestLatency" type="double"/> <attribute name="OutstandingRequests" type="double"/> <attribute name="PacketsReceived" type="double"/> <attribute name="PacketsSent" type="double"/> </mbean> </sample> It doesn't solve the nested property issue, unfortunately, so I may have to flatten some statistics as you have. I'm interested in checking out your code if you don't mind. At a higher level, I'm interested in setting up the sort of monitoring one would expect of a critical datacenter service. To start with, I'd like to collect data necessary to: - page when there's no leader - page when minimum number of replicas to reach quorum are present - email when replicas are missing, but still above quorum minimum. For example, send an email when 1/5 are down, and page when 2/5 are down. Also page if there's no leader for some other reason. The operational metrics like latencies, connections, requests would be useful in troubleshooting issues as well as capacity planning. --travis On Wed, Apr 14, 2010 at 4:50 PM, kishore g <g.kish...@gmail.com> wrote: > Hi Travis, > > We do monitor zookeeper using JMX. We have a simple code which does the > following > > - parse JMX output and convert the output into key value format. The > nested properties are flattened. > - Emit the key values using LWES[ http://www.lwes.org/] Api's at regular > interval[configurable] > - The keys to be emitted can be configured via config file. > > We have our own internal reporting framework which displays these metrics. > In order to differentiate between leader and follower we use separate keys > to > > ReplicatedServer_idXXX_replica.XXX_Follower.AvgRequestLatency=rsf_mrl > ReplicatedServer_idXXX_replica.XXX_Leader.AvgRequestLatency=rsl_mrl > > If the server is leader then rsf_mrl will be empty and vice versa. I can > provide the code to do this and you can probably change it to meet your > needs and enhance it to work for Ganglia. Let me know if this helps you. > > thanks, > Kishore G > > On Wed, Apr 14, 2010 at 11:12 AM, Travis Crawford > <traviscrawf...@gmail.com>wrote: > > > Hey zookeeper gurus - > > > > Are there any recommended ways for one to monitor zookeeper ensembles? > I'm > > familiar with the four-letter words and that stats are published via JMX > - > > I'm more interested in what people are doing with those stats. > > > > I'd like to publish the JMX stats to Ganglia, and this works well for the > > built-in stats. However, the zookeeper-specific names appear to be > dynamic > > which causes issues when deciding what to publish. For example, the > current > > mode (leader/follower) appears to only be accessible from the bean names, > > instead of looking at, say, a "mode" stat. > > > > > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower > > > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader > > > > > > The only way I've found to learn if replicas are up-to-date is looking at > > "synced" buried in followerInfo: > > > > $ java -jar cmdline-jmxclient-0.10.5.jar - localhost:8081 > > > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader > > followerInfo > > 04/14/2010 18:06:06 +0000 org.archive.jmx.Client followerInfo: > > FollowerHandler Socket[addr=/10.0.0.10,port=48104,localport=2888] > > tickOfLastAck:29793 synced?:true queuedPacketLength:0 > > FollowerHandler Socket[addr=/10.0.0.11,port=59599,localport=2888] > > tickOfLastAck:29793 synced?:true queuedPacketLength:0 > > > > > > I don't mind writing a tool to parse the JMX output and publishing to > > Ganglia if needed, but it seems like a problem that may have already been > > solved and I'm curious what others are doing. The tool would basically > take > > the zookeeper stats, normalize the names, and publish to a timeseries > > database. > > > > Is anyone already monitoring ZK in a way others might find useful? > > > > Thanks! > > Travis > > >