Re: Debugging High I/O Wait

2019-04-09 Thread ramkrishna vasudevan
Hi Srinidhi

Am not able to view the attachments for some reason. How ever as Anoop
suggested can you try multi paths for the bucket cache. As said in the
first email - a separate SSD for WAL writes and multiple (more than one
file path) for the bucket cache SSD may help. Here again the mutiple paths
can also be on multiple devices.

REgards
Ram

On Fri, Apr 5, 2019 at 2:58 AM Srinidhi Muppalla 
wrote:

> After some more digging, I discovered that during the time that the RS is
> stuck the kernel message buffer outputted only this message
>
> "[1031214.108110] XFS: java(6522) possible memory allocation deadlock size
> 32944 in kmem_alloc (mode:0x2400240)"
>
> From my reading online, the cause of this error appears to generally be
> excessive memory and file fragmentation. We haven't changed the mslab
> config and we are running HBase 1.3.0 so it should be running by default.
> The issue tends to arise consistently and regularly (every 10 or so days)
> and once one node is affected other nodes start to follow after a few
> hours. What could be causing this to happen and is there any way to prevent
> or minimize fragmentation?
>
> Best,
> Srinidhi
>
> On 3/29/19, 11:02 AM, "Srinidhi Muppalla"  wrote:
>
> Stack and Ram,
>
> Attached the thread dumps. 'Jstack normal' is the normal node. 'Jstack
> problematic' was taken when the node was stuck.
>
> We don't have full I/O stats for the problematic node. Unfortunately,
> it was impacting production so we had to recreate the cluster as soon as
> possible and couldn't get full data. I attached the dashboards with the
> wait I/O and other CPU stats. Thanks for helping look into the issue!
>
> Best,
> Srinidhi
>
>
>
> On 3/28/19, 2:41 PM, "Stack"  wrote:
>
> Mind putting up a thread dump?
>
> How many spindles?
>
> If you compare the i/o stats between a good RS and a stuck one,
> how do they
> compare?
>
> Thanks,
> S
>
>
> On Wed, Mar 27, 2019 at 11:57 AM Srinidhi Muppalla <
> srinid...@trulia.com>
> wrote:
>
> > Hello,
> >
> > We've noticed an issue in our HBase cluster where one of the
> > region-servers has a spike in I/O wait associated with a spike
> in Load for
> > that node. As a result, our request times to the cluster increase
> > dramatically. Initially, we suspected that we were experiencing
> > hotspotting, but even after temporarily blocking requests to the
> highest
> > volume regions on that region-servers the issue persisted.
> Moreover, when
> > looking at request counts to the regions on the region-server
> from the
> > HBase UI, they were not particularly high and our own
> application level
> > metrics on the requests we were making were not very high
> either. From
> > looking at a thread dump of the region-server, it appears that
> our get and
> > scan requests are getting stuck when trying to read from the
> blocks in our
> > bucket cache leaving the threads in a 'runnable' state. For
> context, we are
> > running HBase 1.30 on a cluster backed by S3 running on EMR and
> our bucket
> > cache is running in File mode. Our region-servers all have SSDs.
> We have a
> > combined cache with the L1 standard LRU cache and the L2 file
> mode bucket
> > cache. Our Bucket Cache utilization is less than 50% of the
> allocated space.
> >
> > We suspect that part of the issue is our disk space utilization
> on the
> > region-server as our max disk space utilization also increased
> as this
> > happened. What things can we do to minimize disk space
> utilization? The
> > actual HFiles are on S3 -- only the cache, application logs, and
> write
> > ahead logs are on the region-servers. Other than the disk space
> > utilization, what factors could cause high I/O wait in HBase and
> is there
> > anything we can do to minimize it?
> >
> > Right now, the only thing that works is terminating and
> recreating the
> > cluster (which we can do safely because it's S3 backed).
> >
> > Thanks!
> > Srinidhi
> >
>
>
>
>
>


Re: HBase connection refused after random time delays

2019-04-09 Thread melanka . w



On 2019/04/08 23:11:50, Josh Elser  wrote: 
> 
> 
> On 4/7/19 10:44 PM, melank...@synergentl.com wrote:
> > 
> > 
> > On 2019/04/04 15:15:37, Josh Elser  wrote:
> >> Looks like your RegionServer process might have died if you can't
> >> connect to its RPC port.
> >>
> >> Did you look in the RegionServer log for any mention of an ERROR or
> >> FATAL log message?
> >>
> >> On 4/4/19 8:20 AM, melank...@synergentl.com wrote:
> >>> I have installed Hadoop single node 
> >>> http://intellitech.pro/tutorial-hadoop-first-lab/ and Hbase 
> >>> http://intellitech.pro/hbase-installation-on-ubuntu/  successfully. I am 
> >>> using a Java agent to connect to the Hbase. After a random time period 
> >>> Hbase stop working and the java agent gives following error message.
> >>>
> >>> Call exception, tries=7, retries=7, started=8321 ms ago, cancelled=false, 
> >>> msg=Call to db-2.c.xxx-dev.internal/xx.xx.0.21:16201 failed on connection 
> >>> exception: 
> >>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> >>>  Connection refused: db-2.c.xxx-dev.internal/xx.xx.0.21:16201, 
> >>> details=row 'xxx,001:155390400,99' on table 
> >>> 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> >>> hostname=db-2.c.xxx-dev.internal,16201,1553683263844, seqNum=-1
> >>> Here are the Hbase and zookeeper logs
> >>>
> >>> hbase-hduser-regionserver-db-2.log
> >>>
> >>> [main] zookeeper.ZooKeeperMain: Processing delete 2019-03-30 02:11:44,089 
> >>> DEBUG [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Reading 
> >>> reply sessionid:0x169bd98c099006e, packet:: clientPath:null 
> >>> serverPath:null finished:false header:: 1,2 replyHeader:: 1,300964,0 
> >>> request:: 
> >>> '/hbase/rs/db-2.c.stl-cardio-dev.internal%2C16201%2C1553683263844,-1 
> >>> response:: null
> >>> hbase-hduser-zookeeper-db-2.log
> >>>
> >>> server.FinalRequestProcessor: sessionid:0x169bd98c099004a 
> >>> type:getChildren cxid:0x28e3ad zxid:0xfffe txntype:unknown 
> >>> reqpath:/hbase/splitWAL
> >>> my hbase-site.xml file is as follows
> >>>
> >>> 
> >>>//Here you have to set the path where you want HBase to store its 
> >>> files.
> >>>
> >>>hbase.rootdir
> >>>hdfs://localhost:9000/hbase
> >>>
> >>>
> >>>hbase.zookeeper.quorum
> >>>localhost
> >>>
> >>>//Here you have to set the path where you want HBase to store its 
> >>> built in zookeeper files.
> >>>
> >>>hbase.zookeeper.property.dataDir
> >>>${hbase.tmp.dir}/zookeeper
> >>>
> >>>
> >>>hbase.cluster.distributed
> >>>true
> >>>
> >>>
> >>>hbase.zookeeper.property.clientPort
> >>>2181
> >>>
> >>>
> >>> when I restart the Hbase it will start working again and stop working 
> >>> after few days. I am wondering what would be the fix for this.
> >>>
> >>> Thanks.
> >>> BR,
> >>> Melanka
> >>>
> >> Hi Josh,
> > Sorry for the late reply. I restarted the Hbase on 05/04/2019 and it was 
> > again down on 06/04/2019 at 00.06 AM.
> > 
> > Log from  hbase-root-regionserver-db-2 is as follows.
> > 
> > 2019-04-04 04:42:26,047 DEBUG [main-SendThread(localhost:2181)] 
> > zookeeper.ClientCnxn: Reading reply sessionid:0x169d86a879b00bf, packet:: 
> > clientPath:null serverPath:null finished:false header:: 67,2  replyHeader:: 
> > 67,776370,0  request:: 
> > '/hbase/rs/db-2.c.stl-cardio-dev.internal%2C16201%2C1554352093266,-1  
> > response:: null
> > 2019-04-04 04:42:26,047 DEBUG [main-EventThread] 
> > zookeeper.ZooKeeperWatcher: regionserver:16201-0x169d86a879b00bf, 
> > quorum=localhost:2181, baseZNode=/hbase Received ZooKeeper Event, 
> > type=NodeDeleted, state=SyncConnected, 
> > path=/hbase/rs/db-2.c.stl-cardio-dev.internal,16201,1554352093266
> > 2019-04-04 04:42:26,047 DEBUG [main-EventThread] 
> > zookeeper.ZooKeeperWatcher: regionserver:16201-0x169d86a879b00bf, 
> > quorum=localhost:2181, baseZNode=/hbase Received ZooKeeper Event, 
> > type=NodeChildrenChanged, state=SyncConnected, path=/hbase/rs
> > 2019-04-04 04:42:26,050 DEBUG 
> > [regionserver/db-2.c.xxx-dev.internal/xx.xxx.0.21:16201] 
> > zookeeper.ZooKeeper: Closing session: 0x169d86a879b00bf
> > 2019-04-04 04:42:26,050 DEBUG 
> > [regionserver/db-2.c.xxx-dev.internal/xx.xx.0.21:16201] 
> > zookeeper.ClientCnxn: Closing client for session: 0x169d86a879b00bf
> > 2019-04-04 04:42:26,056 DEBUG [main-SendThread(localhost:2181)] 
> > zookeeper.ClientCnxn: Reading reply sessionid:0x169d86a879b00bf, packet:: 
> > clientPath:null serverPath:null finished:false header:: 68,-11  
> > replyHeader:: 68,776371,0  request:: null response:: null
> > 2019-04-04 04:42:26,056 DEBUG 
> > [regionserver/db-2.c.xxx-dev.internal/xx.xxx.0.21:16201] 
> > zookeeper.ClientCnxn: Disconnecting client for session: 0x169d86a879b00bf
> > 2019-04-04 04:42:26,056 INFO  
> > [regionserver/db-2.c.xxx-dev.internal/xxx.xxx.0.21:16201] 
> > zookeeper.ZooKeeper: Session: 0x169d86a879b00bf closed
> > 2019-04-04 04:42:26,056