Re: regionserver crash under heavy load

Jinsong Hu Wed, 14 Jul 2010 13:17:40 -0700

the test has stopped. so I can't really tell now. what I noticed is that Iwas running"iostat -x 5" all the time and I noticed that there are lots of times thatthe %idle isless than 5%, or even close to 0% . I also notice that the disk %utilhigher than 90%.

This indicates that the disk is pretty busy. On the other hand, I wasrunning 3 mappersto insert records to the hbase, I artificially restrict the insertion rateto be 80 records/second.The table has 20 column families, each family has 4-5 columns. by myestimate, each

record is about 10K byte. so the pumping rate is not really that high.

the table columns are compressed. I do notice that when I disable thecompression,the insertion rate is even slower. This indicates that the disk usageprobably is indeed thebottle neck, but it is just hard to believe that I can't even insert recordat 3x80=240 records/second

with 10K byte records.  The load isn't heavy at all .
Right now I reduce the speed to even lower 20 records/second
and test and see if it will be in better shape.

Did you check the datanode log ? There are lots of errors there. can youtell why it happens ?



Jimmy.

--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Wednesday, July 14, 2010 12:10 PM
To: <[email protected]>
Subject: Re: regionserver crash under heavy load

Thanks for the zookeeper log, I was able to pinpoint more the problem.
So I see this (didn't dig enough in the logs it seems, wasn't
expecting a 3 minutes difference)

2010-07-13 21:49:17,864 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2010-07-13 22:03:46,313 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=3.3544312MB (3517376), Free=405.42056MB (425114272),
Max=408.775MB (428631648), Counts: Blocks=0, Access=6580365, Hit=0,
Miss=658036
5, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
Evicted/Run=NaN
2010-07-13 22:03:46,483 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new compressor
2010-07-13 22:03:46,404 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 908177ms for
sessionid 0x229c87a7f4e0011, closing socket connection and attempting
reconnect2010-07-13 22:03:46,400 INFO
org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor
2010-07-13 22:03:46,313 WARN
org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 2 on
60020 took 868374ms appending an edit to hlog; editcount=4274

At the same time, we can see that the zookeeper log does expire thesession:


2010-07-13 21:52:00,001 - INFO  [SessionTracker:zookeeperser...@315] -
Expiring session 0x229c87a7f4e0011, timeout of 120000ms exceeded
2010-07-13 21:52:00,001 - INFO  [SessionTracker:zookeeperser...@315] -
Expiring session 0x129c87a7f980015, timeout of 120000ms exceeded
2010-07-13 21:52:00,002 - INFO
[ProcessThread:-1:preprequestproces...@385] - Processed session
termination for sessionid: 0x229c87a7f4e0011
2010-07-13 21:52:00,002 - INFO
[ProcessThread:-1:preprequestproces...@385] - Processed session
termination for sessionid: 0x129c87a7f980015

So your region server process was paused for more than 10 minutes, the
logs show that gap and the lines that follow talk about that big
pause. Are your nodes swapping?

J-D

On Wed, Jul 14, 2010 at 11:48 AM, Jinsong Hu <[email protected]>wrote:

the zookeepers are on 3 separate physical machine that colocates with 3
master.
I have put the logs for them here.
http://t-collectors1.proofpoint.com:8079/zookeeper.tar.gz
the version is also from CDH3 June 10 release.

I checked the datanode in that period , there are lots of exceptions, Ihave

put it here:
http://t-collectors1.proofpoint.com:8079/backup.tar.gz

Jinsong

--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Wednesday, July 14, 2010 11:16 AM
To: <[email protected]>
Subject: Re: regionserver crash under heavy load

So your region servers had their session expired but I don't see any
sign of GC activity. Usually that points to a case where zookeeper
isn't able to answer request fast enough because it is IO starved. Are
the zookeeper quorum member on the same nodes as the region servers?

J-D

On Wed, Jul 14, 2010 at 10:16 AM, Jinsong Hu <[email protected]>
wrote:


I have uploaded the crashed log files to

http://somewhere
http://somewhere

it includes the GC log and config files too. both of these regionserver
crashed when I

use only 3 mappers to insert records to hbase, and each task I limitthe

data rate
to 80 records /second.

The machine I use is relatively old, having only 4G of ram. 4 core CPU,
and
250G disk.

I run tasktracker , datanode and regionserver on them. I have 3machines,

only 1 regionserver

is still up after the continuous insertion for overnight. before thetest

I
created  hbase
tables so I started with an empty table. I saw it running fine for
several
hours and this morning
I check again, 2 regionserver died. I backup the log and restart the
regionserver, and then
I found the regionserver process is up, but not listening to any port.

the log doesn't show any error:
2010-07-14 16:23:45,769 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
: Set watcher on master address ZNode /hbase/master
2010-07-14 16:23:45,951 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
: Read ZNode /hbase/master got 10.110.24.48:60000
2010-07-14 16:23:45,957 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at 10.110.24.48:60000 that we are up
2010-07-14 16:23:46,283 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Installed shutdown hook thread: Shutdownhook:regionserver60020


I am using the June 10 release of the cloudera distribution for both
hadoop
and hbase.

Jimmy



--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Tuesday, July 13, 2010 4:24 PM
To: <[email protected]>
Subject: Re: regionserver crash under heavy load

Your region server doesn't look much loaded from the metrics POV. But
I specifically asked for the lines around that, not just the dump,
since it will contain the reason for the shutdown.

I do notice that the disk usage is pretty high.  I am just thinking
that
our
problem probably is a hardware limit. but the server should not crash
when
the hardware limit is reached.


We still don't know why it crashed and it may not even be related to
HW limits, we need those bigger log traces. Also use pastebin.com or
anything like that.


do you have any idea when CDH3 official release will be out ?


I don't work for cloudera, but IIRC the next beta for CDH3 is due for
September.


Jimmy

--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Tuesday, July 13, 2010 2:55 PM
To: <[email protected]>
Subject: Re: regionserver crash under heavy load

Please use a pasting service for the log traces. I personally use
pastebin.com

You probably had a GC that lasted too long, this is something out of

the control of the application (apart from trying to put as lessdatain memory as possible, but you are inserting so...). Your logdoesn'tcontain enough information for us to tell, please look for a "Dumpof

metrics" line and paste the lines around that.

J-D

On Tue, Jul 13, 2010 at 2:49 PM, Jinsong Hu <[email protected]>
wrote:


Hi, Todd:

I downloaded hadoop-0.20.2+320 and hbase-0.89.20100621+17 fromCDH3

and
inserted data with full load, after a while the hbase regionserver
crashed.
I checked  system with "iostat -x 5" and notice the disk is pretty
busy.
Then I modified my client code and reduced the insertion rate by 6
times,
and the test runs fine.  Is there any way that regionserver be
modified
so
that at least it doesn't crash under heavy load ?  I used apache
hbase

0.20.5 distribution and the same problem happens. I am thinkingthat

when
the regionserver is too busy, it should throttle incoming data rate
to
protect the server.  Could this be done ?

Do you also know when the CDH3 official release will come out ?the

one
I
downloaded is beta version.

Jimmy






2010-07-13 02:24:34,389 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Close
d Spam_MsgEventTable,56-2010-05-19
10:09:02\x099a420f4f31748828fd24aeea1d06b294,
1278973678315.01dd22f517dabf53ddd135709b68ba6c.
2010-07-13 02:24:34,389 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
aborting server at: m0002029.ppops.net,60020,1278969481450
2010-07-13 02:24:34,389 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
: Closed connection with ZooKeeper; /hbase/root-region-server
2010-07-13 02:24:34,389 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020 exiting
2010-07-13 02:24:34,608 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Shutdown hook starting; hbase.shutdown.hook=true;
fsShutdownHook=Thread[Thread-1
0,5,main]
2010-07-13 02:24:34,608 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Starting fs shutdown hook thread.
2010-07-13 02:24:34,608 ERROR org.apache.hadoop.hdfs.DFSClient:
Exception
closin
g file


/hbase/.logs/m0002029.ppops.net,60020,1278969481450/10.110.24.79%3A60020.
1278987220794 : java.io.IOException: IOException
flush:java.io.IOException:
IOEx
ception flush:java.io.IOException: IOException
flush:java.io.IOException:
IOExce
ption flush:java.io.IOException: IOException
flush:java.io.IOException:
IOExcept

ion flush:java.io.IOException: IOExceptionflush:java.io.IOException:

IOExceptio
n flush:java.io.IOException: IOException flush:java.io.IOException:
Error
Recove
ry for block blk_-1605696159279298313_2395924 failed  because
recovery
from
prim
ary datanode 10.110.24.80:50010 failed 6 times.  Pipeline was
10.110.24.80:50010
. Aborting...
java.io.IOException: IOException flush:java.io.IOException:
IOException
flush:ja

va.io.IOException: IOException flush:java.io.IOException:IOException

flush:java
.io.IOException: IOException flush:java.io.IOException: IOException
flush:java.i
o.IOException: IOException flush:java.io.IOException: IOException
flush:java.io.
IOException: IOException flush:java.io.IOException: Error Recovery
for
block
blk
_-1605696159279298313_2395924 failed  because recovery from primary
datanode
10.
110.24.80:50010 failed 6 times.  Pipeline was 10.110.24.80:50010.
Aborting...
   at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:
3214)
   at
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:
97)
   at
org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944
)
   at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(S
equenceFileLogWriter.java:124)
   at
org.apache.hadoop.hbase.regionserver.wal.HLog.hflush(HLog.java:826)
   at
org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1004)
   at
org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:817)
   at
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.j
ava:1531)
   at
org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447)
   at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.
java:1703)
   at
org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionSe
rver.java:2361)
   at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
   at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
919)
2010-07-13 02:24:34,610 ERROR org.apache.hadoop.hdfs.DFSClient:
Exception
closin
g file


/hbase/Spam_MsgEventTable/079c7de876422e57e5f09fef5d997e06/.tmp/677365813
4549268273 : java.io.IOException: All datanodes 10.110.24.80:50010
are
bad.
Abor
ting...
java.io.IOException: All datanodes 10.110.24.80:50010 are bad.
Aborting...
   at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError
(DFSClient.java:2603)
   at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClien
t.java:2139)
   at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFS
Client.java:2306)
2010-07-13 02:24:34,729 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook:
Shutdown hook finished.

Re: regionserver crash under heavy load

Reply via email to