Hey Friso,

A few thoughts:

 - Using the WAL will degrade your performance a lot and this is
expected. Remember, appending is the operation of sending a chunk of
data to the memory of 3 datanodes in a pipeline (compared to just
writing to memory when not using it). I'm not sure I quite understand
your surprise there.

 - Regarding the YouAreDeadException, the fact that it prints out
"have not heard from server in 70697ms" and that just before that
nothing was printed in the log between 02:38:55 and 02:40:11, strongly
indicates GC activity in that JVM. I'd like to see the GC log before
ruling that out.

 - Regarding the clients that died, did it happen at the same time as
region server died?

 - Finally, doing massive imports requires finer tuning your cluster
compared to one that serves "normal" traffic. Have you considered
using the bulk loading tools instead?

J-D

On Wed, Dec 15, 2010 at 6:44 AM, Friso van Vollenhoven
<[email protected]> wrote:
> Hi,
>
> I am experiencing some performance issues doing puts with WAL enabled. 
> Without, everything runs fine.
>
> My workload is doing roughly 30 million reads (rows) and after each read do a 
> number of puts (update multiple indexes, basically). Total is about 165 
> million puts. The work is done from a MR job that uses 15 reducers against 8 
> RS. The reads and writes are across 16 tables, but I hit only a small number 
> of regions (out of about 1000 in total for all tables). Without WAL, the job 
> takes about 30 to 45 minutes. With WAL, if it runs to completion, it takes 
> close to 4 hours (3h45m). Can the difference be that large? On the master UI 
> HBase shows doing between 10K and 50K requests per second with quite some 
> drops to almost zero for some amount of time, while without WAL for the same 
> job it easily reaches over 100K sustained.
>
> Any hint on where to look is greatly appreciated. Below is a description of 
> what happens.
>
> Also, on some runs, region servers die because they fail to report for more 
> than 1 minute (YouAreDeadException). I could set the timeout longer, but I 
> think it should work for this setup. The GC log does not show any obvious 
> long pause. Is it possible for flushing / log appending to block the ZK 
> client / heartbeat? In the log snippet below you see a pause of about 1m30s 
> between two log lines before the RS starts to shut down.
>
> 2010-12-15 02:38:55,457 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://m1r1.inrdb.ripe.net:9000/hbase/inrdb_ris_update_rrc12/fe902bb3224a1522b0be94d8459f7217/meta/3270107182044894418,
>  entries=10653, sequenceid=367327898, memsize=2.4m, filesize=102.7k
> 2010-12-15 02:40:11,959 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 70697ms for sessionid 
> 0x12ce510319b0004, closing sock
> et connection and attempting reconnect
> 2010-12-15 02:40:12,125 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 66270ms for sessionid 
> 0x12ce510319b0005, closing sock
> et connection and attempting reconnect
> 2010-12-15 02:40:12,174 INFO 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
> hconnection-0x12ce510319b0004 Received Disconnected from ZooKeeper, ignoring
> 2010-12-15 02:40:12,276 INFO 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
> regionserver:60020-0x12ce510319b0005 Received Disconnected from ZooKeeper, 
> ignoring
> 2010-12-15 02:40:12,623 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server m1r1.inrdb.ripe.net/2001:610:240:1:0:0:c100:1733:2181
> 2010-12-15 02:40:12,802 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> serverName=w2r1.inrdb.ripe.net,60020,1292333234919, load=(requests
> =10822, regions=575, usedHeap=6806, maxHeap=16000): Unhandled exception: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing w2r1.inr
> db.ripe.net,60020,1292333234919 as dead server
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing w2r1.inrdb.ripe.net,60020,1292333234
> 919 as dead server
>
> On some other runs there occurs blocking on the client side. This causes the 
> reducers to get killed after ten minutes of not reporting. Still, the GC log 
> does not show any long pauses.
>
> I guess that the RS dying and the clients blocking are just side effects of 
> HBase not being able to cope with the load.
>
> Versions and setup:
> Hadoop CDH3b3
> HBase 0.90 rc1
> 1 master, running: NN, HM, JT, ZK
> 8 workers, running: DN, RS, TT
> RS gets 16GB heap
> HBase max filesize = 1GB, client side write buffer = 16MB, memstore flush 
> size is at 128MB
> CPU usage is not off the chart and no swapping is happening.
>
>
>
> Friso
>
>

Reply via email to