Re: HBase fail-over/reliability issues

2010-05-10 Thread Todd Lipcon
Hi James, I'd recommend just the following in your log4j properties to tone down the log volume: log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN log4j.logger.org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace=WARN This will keep the INFO level logs that are very useful for deb

Re: HBase fail-over/reliability issues

2010-05-10 Thread James Baldassari
Hi Todd, Our log files were getting to be several gigabytes in size at the INFO level (particularly the datanode logs), so we changed the log level in all log4j configs to be WARN. Do you think we're potentially missing some useful information at INFO and lower? I could lower the log level if yo

Re: HBase fail-over/reliability issues

2010-05-08 Thread Todd Lipcon
Hi James, You'll need to go farther back in the logs to find what happened to the block that caused it to get deleted. All of the logs below are too late (the block's already gone, we need to figure out why). Can you look backwards through the past several days of the NN logs? Have you disabled t

Re: HBase fail-over/reliability issues

2010-05-07 Thread James Baldassari
OK, these logs are huge, so I'm just going to post the first 1,000 lines from each for now. Let me know if it would be helpful to have more. The namenode logs didn't contain either of the strings you were interested in. A few of the datanode logs had '4841840178880951849': http://pastebin.com/4M

Re: HBase fail-over/reliability issues

2010-05-07 Thread Todd Lipcon
If you can grep for '4841840178880951849' as well as /hbase/users/73382377/data/312780071564432169 across all of your datanode logs plus your NN, and put that online somewhere, that would be great. If you can grep with -C 20 to get some context that would help as well. Grepping for the region in q

Re: HBase fail-over/reliability issues

2010-05-07 Thread James Baldassari
Thanks, I'll check out HBase-2231. Prior to this problem occurring our cluster had been running for almost 2 weeks with no problems. I'm not sure about the GC pauses, but I'll look through the logs. I've never noticed that before, though. Also, maybe it would help to understand how we're using

Re: HBase fail-over/reliability issues

2010-05-07 Thread James Baldassari
On Sat, May 8, 2010 at 12:02 AM, Stack wrote: > On Fri, May 7, 2010 at 8:27 PM, James Baldassari > wrote: > > java.io.IOException: Cannot open filename > > /hbase/users/73382377/data/312780071564432169 > > > This is the regionserver log? Is this deploying the region? It fails? > This error is

Re: HBase fail-over/reliability issues

2010-05-07 Thread Todd Lipcon
This could very well be HBASE-2231. Do you find that region servers occasionally crash after going into GC pauses? -Todd On Fri, May 7, 2010 at 9:02 PM, Stack wrote: > On Fri, May 7, 2010 at 8:27 PM, James Baldassari > wrote: > > java.io.IOException: Cannot open filename > > /hbase/users/7338

Re: HBase fail-over/reliability issues

2010-05-07 Thread Stack
On Fri, May 7, 2010 at 8:27 PM, James Baldassari wrote: > java.io.IOException: Cannot open filename > /hbase/users/73382377/data/312780071564432169 > This is the regionserver log? Is this deploying the region? It fails? > Our cluster throughput goes from around 3k requests/second down to 500-10