Hello Folks, Request you to share your experiences on this
Thanks, Prabhjot On Nov 23, 2015 3:53 PM, "Prabhjot Bharaj" <[email protected]> wrote: > Hello Folks, > > I would like to know what are the important zookeeper parameters that can > be monitored on a zookeeper server via its JMX port. I've setup my 5-node > zookeeper ensemble with the required steps present on this page: > https://zookeeper.apache.org/doc/r3.4.6/zookeeperJMX.html#ch_console > > After connecting to the JVM via jconsole, I can see the stats. But, I > would like to know which stats/values we can send to our reporting system > so that we can be alerted if some vital parameter is showing unexpected > value. > -------------------------------------------- > -------------------------------------------- > -------------------------------------------- > -------------------------------------------- > -------------------------------------------- > Here is the homework I've done on it:- > > *1. *QuorumSize (under ReplicatedServer_id<#myid value>) - Must always be > equal to the number of nodes in zookeeper.conf. > > 1. > > Example MBean - > org.apache.ZooKeeperService:name0=ReplicatedServer_id7 > 2. > > Alert - It should never be lower than (floor(n/2) +1). If this > happens, the cluster’s health is bad. Alert on this value going lower > than > (floor(n/2) + 1), where n is the total machines participating in the > ensemble > > c. Procedure - bounce the servers which are not participating in the > quorum and see if it changes anything on this attribute > > 2. NodeCount (under InMemoryDataTree) - from all the machines in a > cluster should be equal. This helps us check consistency of nodes in the > zookeeper cluster. > > 1. > > Example MBean - > > org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader,name3=InMemoryDataTree > 2. > > Alert - if any of the nodes in the cluster shows a different value > than the total number of nodes in the ensemble, fire an alert > > c. Procedure - There is no generalised solution for this. This will need > investigation. > > 3. Memory Management - > a. GarbageCollection - Listing important parameters for monitoring > garbage collection on the zookeeper server nodes. Any value in this > section, if it is significantly higher than that of other nodes in the > ensemble can point to something fishy in the cluster. > i. ConcurrentMarkSweep time to be monitored across all nodes > Example MBean - java.lang:type=GarbageCollector,name=ConcurrentMarkSweep > ii. ParNew time to be monitored across all nodes > Example MBean - java.lang:type=GarbageCollector,name=ParNew > > 4. Leader count - this must be 1 at all times - out of all the > replica.<#myid values> under ReplicatedServer_id<#myid value> on all > machines, there should be only 1 leader. > a. Example MBean - > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader. > > 1. > > Alert - name<x>=Leader should be only 1 from all the nodes reporting > data in the cluster - setup an alert on this. If the alert is fired, it > means zookeeper went through a split brain. This is a high-risk thing. > 2. > > Procedure - check if network is all good amongst the machines. If some > n/w slowness amongst nodes in a rack, or across a rack (in case zookeeper > nodes are placed across racks), then it must be taken care of. Until it is > solved, find a good machine which has good n/w connectivity. push a config > for adding this new machine in the cluster and remove the existing machine > from the cluster. > > > > -------------------------------------------- > -------------------------------------------- > -------------------------------------------- > -------------------------------------------- > -------------------------------------------- > > > I would like to know if the above parameters for monitoring the cluster > are sufficient, or did I miss something out ? Request your help in pointing > me in the right direction. Please feel free to point out any changes in the > above write-up > > > Thanks, > > Prabhjot > > >
