Here's my take on the issue. > I monitored the > process and when any node fails, it has not used all the heaps yet. > So it is not a heap space problem.
I disagree. Unless you load a region server heap with more data than there's heap available (loading batches of humongous rows for example), it will not fill it. It doesn't mean you have enough heap, because HBase will take precautions in order to not run out of memory. In your case, you have a lot of block cache trashing: 2011-12-01 17:05:49,084 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 79.68 MB of total=677.18 MB 2011-12-01 17:05:49,087 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=79.72 MB, total=597.78 MB, single=372.13 MB, multi=298.71 MB, memory=0 KB 2011-12-01 17:05:50,069 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 79.67 MB of total=677.17 MB 2011-12-01 17:05:50,084 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=79.67 MB, total=597.75 MB, single=372.05 MB, multi=298.71 MB, memory=0 KB etc This is the kind of precautions I'm talking about. BTW in MR jobs you should always disable the block cache like showed in this example: http://hbase.apache.org/book/mapreduce.example.html#mapreduce.example.read scan.setCacheBlocks(false); // don't set to true for MR jobs I don't know if this is related to your current job, not clear from your description of the job if the mapping is done on HBase. > And finally, according to the logs I pasted, I see other lines with DEBUG > or INFO. So I thought this was okay. > Is there a way to change WARN level log to some other level log? If you'd > let me know, I will paste another set of logs. The connection reset stuff is interesting, and this warning indeed points that somethings weird. It would be interesting to see some tasks logs (not the TaskTracker, nor the JobTracker, they are usually of little while debugging this type of problem). In any case what it means is that the client (the map or reduce task, or even some other client you have) reset the connection, so the region server just drops it. > The regionserver that contains that specific region fails. That is the > point. If I move that region to another regionserver using hbase shell, > then that regionserver fails. > With the same log output. You haven't shown us a log output of a dying region server yet. Actually from those logs I don't even see a lot of importing going on, just a lot of reading. Look for ERROR level logging, then grab everything that's around that and post it here please (go up in the log to the point where it looks like normal logging, usually the ERROR will get log after some important lines). It would also be interesting to see the full reducer task log. J-D On Thu, Dec 1, 2011 at 12:48 AM, edward choi <[email protected]> wrote: > Hi, > I've had a problem that has been killing for some days now. > I am using CDH3 update2 version of Hadoop and Hbase. > When I do a large amount of bulk loading into Hbase, some node always die. > It's not just one particular node. > But one of many nodes fail to serve eventually. > > I set 4 gigs of heap space for master, and regionservers. I monitored the > process and when any node fails, it has not used all the heaps yet. > So it is not a heap space problem. > > Below is what I get when I perform bulk put using MapReduce. >
