substantial performance degradation when using WAL

Friso van Vollenhoven Wed, 15 Dec 2010 06:44:55 -0800

Hi,

I am experiencing some performance issues doing puts with WAL enabled. Without, 
everything runs fine.

My workload is doing roughly 30 million reads (rows) and after each read do a
number of puts (update multiple indexes, basically). Total is about 165 million
puts. The work is done from a MR job that uses 15 reducers against 8 RS. The
reads and writes are across 16 tables, but I hit only a small number of regions
(out of about 1000 in total for all tables). Without WAL, the job takes about
30 to 45 minutes. With WAL, if it runs to completion, it takes close to 4 hours
(3h45m). Can the difference be that large? On the master UI HBase shows doing
between 10K and 50K requests per second with quite some drops to almost zero
for some amount of time, while without WAL for the same job it easily reaches
over 100K sustained.

Any hint on where to look is greatly appreciated. Below is a description of
what happens.

Also, on some runs, region servers die because they fail to report for more
than 1 minute (YouAreDeadException). I could set the timeout longer, but I
think it should work for this setup. The GC log does not show any obvious long
pause. Is it possible for flushing / log appending to block the ZK client /
heartbeat? In the log snippet below you see a pause of about 1m30s between two
log lines before the RS starts to shut down.

2010-12-15 02:38:55,457 INFO org.apache.hadoop.hbase.regionserver.Store: Added
hdfs://m1r1.inrdb.ripe.net:9000/hbase/inrdb_ris_update_rrc12/fe902bb3224a1522b0be94d8459f7217/meta/3270107182044894418,
entries=10653, sequenceid=367327898, memsize=2.4m, filesize=102.7k
2010-12-15 02:40:11,959 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 70697ms for sessionid
0x12ce510319b0004, closing sock
et connection and attempting reconnect
2010-12-15 02:40:12,125 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 66270ms for sessionid
0x12ce510319b0005, closing sock
et connection and attempting reconnect
2010-12-15 02:40:12,174 INFO
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
hconnection-0x12ce510319b0004 Received Disconnected from ZooKeeper, ignoring
2010-12-15 02:40:12,276 INFO
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
regionserver:60020-0x12ce510319b0005 Received Disconnected from ZooKeeper,
ignoring
2010-12-15 02:40:12,623 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server m1r1.inrdb.ripe.net/2001:610:240:1:0:0:c100:1733:2181
2010-12-15 02:40:12,802 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=w2r1.inrdb.ripe.net,60020,1292333234919, load=(requests
=10822, regions=575, usedHeap=6806, maxHeap=16000): Unhandled exception:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently
processing w2r1.inr
db.ripe.net,60020,1292333234919 as dead server
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently
processing w2r1.inrdb.ripe.net,60020,1292333234
919 as dead server

On some other runs there occurs blocking on the client side. This causes the
reducers to get killed after ten minutes of not reporting. Still, the GC log
does not show any long pauses.

I guess that the RS dying and the clients blocking are just side effects of
HBase not being able to cope with the load.

Versions and setup:
Hadoop CDH3b3
HBase 0.90 rc1
1 master, running: NN, HM, JT, ZK
8 workers, running: DN, RS, TT
RS gets 16GB heap
HBase max filesize = 1GB, client side write buffer = 16MB, memstore flush size
is at 128MB
CPU usage is not off the chart and no swapping is happening.

Friso

substantial performance degradation when using WAL

Reply via email to