Re: Zookeeper session timeouts during RAID Checks

Patrick Hunt Mon, 07 Oct 2013 23:54:21 -0700

Hi Srikanth. Do you see any of these in your server logs?

                    LOG.warn("fsync-ing the write ahead log in "
                            + Thread.currentThread().getName()
                            + " took " + syncElapsedMS
                            + "ms which will adversely effect
operation latency. "
                            + "See the ZooKeeper troubleshooting guide");


Patrick

On Mon, Oct 7, 2013 at 11:45 PM, Srikanth R <[email protected]> wrote:
> hi zookeepers,
>
>  I am using zookeeper 3.4.5 in a 3 server ensemble mode. And its datadir is
> in a dedicated 6 disk 2.5TB  Raid10 Volume. Only HDFS namenode/journal txns
> and Zookeeper txnlog/snapshots are written to this volume. The issue is
> whenever the weekly raid check is running, clients that have 5 Sec Timeouts
> are timing out randomly. Has anyone seen issues like this with datadir on
> Raid before ?
>
> Also there isnt much writes going into ZK, only hadoop-ha and hbase master
> are using the ZK services.
>
> 1. There are no cpu bottlenecks or memory/swapping issues on the boxes.
> 2.  In ZK strace output, there are a few random 2-3 secs intervals where no
> system calls are recorded, which is weird. And most of the timeouts
> correspond to this time period. But not able to figure out what ZK does
> during that intervals.
> 3. Enabled GC logs, no traces of full GC during timeouts. Though there were
> full GCs recorded over period of time, the pause is only for 0.3-0.4 secs.
> Also tried the ConcMarkSweep GC without any improvement.
> 4. There are not network errors/timeouts.
> 5. At times I see a max latency of 3-4 secs in connection stats, but avg
> and min latency are 0.
> 6. ran zk-latencies.py and latency seems to be same with and without raid
> check.
>
> Here's the zookeeper config
>
> tickTime=2000
> initLimit=10
> syncLimit=5
> dataDir=/data/zookeeper
> clientPort=2181
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> server.1=xyz1:2888:3888
> server.2=xyz2:2888:3888
> server.3=xyz3:2888:3888
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> jaasLoginRenew=3600000
> kerberos.removeHostFromPrincipal=true
>
> Partition:
>
> -bash-4.1$ df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/md2              116G   79G   32G  72% /
> tmpfs                  12G     0   12G   0% /dev/shm
> /dev/md0               97M   31M   61M  34% /boot
> /dev/md3              2.6T  297M  2.5T   1% /data
>
> -bash-4.1$ cat /proc/mdstat
> Personalities : [raid10] [raid1]
> md3 : active raid10 sdc5[2] sdd5[3] sda5[0] sdf5[5] sdb5[1] sde5[4]
>       2782511616 blocks super 1.1 512K chunks 2 near-copies [6/6] [UUUUUU]
>       [===================>.]  check = 95.3% (2654099584/2782511616)
> finish=41.5min speed=51516K/sec
>       bitmap: 0/21 pages [0KB], 65536KB chunk
>
> Here are my queries,
> 1. what is the best way to find out what the Zookeeper threads are doing
> (strace hasnt helped much)
> 2. There isnt much data written to/read from ZK. why would ZK fail ?
> 3. Is it possible to trace all the requests that come in to ZK ?
>
> Please let me know if you need more info. Any help is greatly appreciated.
>
> Thanks.
> Srikanth

Re: Zookeeper session timeouts during RAID Checks

Reply via email to