Hi,
I've had a problem that has been killing for some days now.
I am using CDH3 update2 version of Hadoop and Hbase.
When I do a large amount of bulk loading into Hbase, some node always die.
It's not just one particular node.
But one of many nodes fail to serve eventually.
I set 4 gigs of heap space for master, and regionservers. I monitored the
process and when any node fails, it has not used all the heaps yet.
So it is not a heap space problem.
Below is what I get when I perform bulk put using MapReduce.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
11/12/01 17:17:20 INFO mapred.JobClient: map 100% reduce 100%
11/12/01 17:18:31 INFO mapred.JobClient: Task Id :
attempt_201111302113_0034_r_000013_0, Status : FAILED
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
1 action: servers with issues: lp171.etri.re.kr:60020,
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at etri.qa.mapreduce.PostProcess$PostPro
attempt_201111302113_0034_r_000013_0: 20111122
11/12/01 17:18:36 INFO mapred.JobClient: Task Id :
attempt_201111302113_0034_r_000013_1, Status : FAILED
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
1 action: servers with issues: lp171.etri.re.kr:60020,
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at etri.qa.mapreduce.PostProcess$PostPro
attempt_201111302113_0034_r_000013_1: 20111122
11/12/01 17:18:37 INFO mapred.JobClient: map 100% reduce 95%
11/12/01 17:18:44 INFO mapred.JobClient: map 100% reduce 96%
11/12/01 17:18:47 INFO mapred.JobClient: map 100% reduce 98%
11/12/01 17:18:50 INFO mapred.JobClient: map 100% reduce 99%
11/12/01 17:18:53 INFO mapred.JobClient: map 100% reduce 100%
11/12/01 17:20:07 INFO mapred.JobClient: Task Id :
attempt_201111302113_0034_r_000013_3, Status : FAILED
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
1 action: servers with issues: lp171.etri.re.kr:60020,
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1239)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1253)
at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:828)
at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:684)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:669)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:127)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:82)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at etri.qa.mapreduce.PostProcess$PostPro
attempt_201111302113_0034_r_000013_3: 20111122
11/12/01 17:20:09 INFO mapred.JobClient: map 100% reduce 95%
11/12/01 17:20:09 INFO mapred.JobClient: Job complete: job_201111302113_0034
11/12/01 17:20:09 INFO mapred.JobClient: Counters: 23
11/12/01 17:20:09 INFO mapred.JobClient: Job Counters
11/12/01 17:20:09 INFO mapred.JobClient: Launched reduce tasks=34
11/12/01 17:20:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4701750
11/12/01 17:20:09 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
11/12/01 17:20:09 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
11/12/01 17:20:09 INFO mapred.JobClient: Rack-local map tasks=54
11/12/01 17:20:09 INFO mapred.JobClient: Launched map tasks=310
11/12/01 17:20:09 INFO mapred.JobClient: Data-local map tasks=256
11/12/01 17:20:09 INFO mapred.JobClient: Failed reduce tasks=1
11/12/01 17:20:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=3043956
11/12/01 17:20:09 INFO mapred.JobClient: FileSystemCounters
11/12/01 17:20:09 INFO mapred.JobClient: FILE_BYTES_READ=1120
11/12/01 17:20:09 INFO mapred.JobClient: HDFS_BYTES_READ=36392
11/12/01 17:20:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=69076396
11/12/01 17:20:09 INFO mapred.JobClient: Map-Reduce Framework
11/12/01 17:20:09 INFO mapred.JobClient: Reduce input groups=0
11/12/01 17:20:09 INFO mapred.JobClient: Combine output records=0
11/12/01 17:20:09 INFO mapred.JobClient: Map input records=2187386
11/12/01 17:20:09 INFO mapred.JobClient: Reduce shuffle bytes=289536
11/12/01 17:20:09 INFO mapred.JobClient: Reduce output records=0
11/12/01 17:20:09 INFO mapred.JobClient: Spilled Records=2187386
11/12/01 17:20:09 INFO mapred.JobClient: Map output bytes=91870212
11/12/01 17:20:09 INFO mapred.JobClient: Combine input records=0
11/12/01 17:20:09 INFO mapred.JobClient: Map output records=2187386
11/12/01 17:20:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=36392
11/12/01 17:20:09 INFO mapred.JobClient: Reduce input records=0
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
So I looked up the log of the node that has failed me.
Below is the cause of failure.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2011-12-01 17:07:37,968 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed; freed=79.73 MB, total=597.89 MB, single=372.17 MB, multi=298.71
MB, memory=0 KB
2011-12-01 17:07:39,721 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started; Attempting to free 79.7 MB of total=677.2 MB
2011-12-01 17:07:39,737 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed; freed=79.71 MB, total=598.06 MB, single=372.2 MB, multi=298.71
MB, memory=0 KB
2011-12-01 17:07:41,414 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started; Attempting to free 79.71 MB of total=677.21 MB
2011-12-01 17:07:41,418 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed; freed=79.72 MB, total=597.82 MB, single=372.16 MB, multi=298.71
MB, memory=0 KB
2011-12-01 17:07:41,849 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60020: readAndProcess threw exception java.io.IOException:
Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2011-12-01 17:10:41,509 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
8383367466793958870 lease expired
2011-12-01 17:11:06,816 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=601.74 MB,
free=194.94 MB, max=796.67 MB, blocks=9281, accesses=3415277, hits=106627,
hitRatio=3.12%%, cachingAccesses=2734012, cachingHits=88768,
cachingHitsRatio=3.24%%, evictions=2227, evicted=2635963,
evictedPerRun=1183.6385498046875
2011-12-01 17:13:33,457 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started; Attempting to free 79.72 MB of total=677.22 MB
2011-12-01 17:13:33,461 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed; freed=79.75 MB, total=598.11 MB, single=372.35 MB, multi=298.71
MB, memory=0 KB
2011-12-01 17:13:34,169 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started; Attempting to free 79.71 MB of total=677.22 MB
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I searched google and the most prominent answer I got was the small number
of Zookeeper connections allowed. But since I am using Hbase 0.90.4, I
thought that was not my case.
Could anyone please enlighten me with this problem?
Regards,
Ed