Here's my take on the issue.

> I monitored the
> process and when any node fails, it has not used all the heaps yet.
> So it is not a heap space problem.

I disagree. Unless you load a region server heap with more data than
there's heap available (loading batches of humongous rows for
example), it will not fill it. It doesn't mean you have enough heap,
because HBase will take precautions in order to not run out of memory.
In your case, you have a lot of block cache trashing:

2011-12-01 17:05:49,084 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
eviction started; Attempting to free 79.68 MB of total=677.18 MB
2011-12-01 17:05:49,087 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
eviction completed; freed=79.72 MB, total=597.78 MB, single=372.13 MB,
multi=298.71 MB, memory=0 KB
2011-12-01 17:05:50,069 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
eviction started; Attempting to free 79.67 MB of total=677.17 MB
2011-12-01 17:05:50,084 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
eviction completed; freed=79.67 MB, total=597.75 MB, single=372.05 MB,
multi=298.71 MB, memory=0 KB
etc

This is the kind of precautions I'm talking about. BTW in MR jobs you
should always disable the block cache like showed in this example:
http://hbase.apache.org/book/mapreduce.example.html#mapreduce.example.read

scan.setCacheBlocks(false);  // don't set to true for MR jobs

I don't know if this is related to your current job, not clear from
your description of the job if the mapping is done on HBase.

> And finally, according to the logs I pasted, I see other lines with DEBUG
> or INFO. So I thought this was okay.
> Is there a way to change WARN level log to some other level log? If you'd
> let me know, I will paste another set of logs.

The connection reset stuff is interesting, and this warning indeed
points that somethings weird. It would be interesting to see some
tasks logs (not the TaskTracker, nor the JobTracker, they are usually
of little while debugging this type of problem). In any case what it
means is that the client (the map or reduce task, or even some other
client you have) reset the connection, so the region server just drops
it.

> The regionserver that contains that specific region fails. That is the
> point. If I move that region to another regionserver using hbase shell,
> then that regionserver fails.
> With the same log output.

You haven't shown us a log output of a dying region server yet.
Actually from those logs I don't even see a lot of importing going on,
just a lot of reading. Look for ERROR level logging, then grab
everything that's around that and post it here please (go up in the
log to the point where it looks like normal logging, usually the ERROR
will get log after some important lines).

It would also be interesting to see the full reducer task log.

J-D

On Thu, Dec 1, 2011 at 12:48 AM, edward choi <[email protected]> wrote:
> Hi,
> I've had a problem that has been killing for some days now.
> I am using CDH3 update2 version of Hadoop and Hbase.
> When I do a large amount of bulk loading into Hbase, some node always die.
> It's not just one particular node.
> But one of many nodes fail to serve eventually.
>
> I set 4 gigs of heap space for master, and regionservers. I monitored the
> process and when any node fails, it has not used all the heaps yet.
> So it is not a heap space problem.
>
> Below is what I get when I perform bulk put using MapReduce.
>

Reply via email to