the RS log is huge. What do you want to see other than what was posted earlier?
On Sep 11, 2014, at 6:10 AM, Qiang Tian <tian...@gmail.com> wrote: > that is strange. we hit it in 0.94 as well: > https://issues.apache.org/jira/browse/HBASE-11695 > what Nicolas raised makes sense to me, but revisiting the code, the flush > every 10s in the RS log actually comes from HRegion#shouldFlush, so there > is sth triggered.. > > could you pastebin the RS log? > > > > > On Wed, Sep 10, 2014 at 6:59 PM, Brian Jeltema < > brian.jelt...@digitalenvoy.net> wrote: > >> >>> out of curiosity, did you see below messages in RS log? >>> >>> LOG.warn("Snapshot called again without clearing previous. " + >>> "Doing nothing. Another ongoing flush or did we fail last >>> attempt?”); >> >> >> Nope >> >> >>> >>> thanks. >>> >>> On Tue, Sep 9, 2014 at 2:15 AM, Brian Jeltema < >>> brian.jelt...@digitalenvoy.net> wrote: >>> >>>> I’ve resolved these problems by restarting the region server that owned >>>> the region in question. >>>> I don’t know what the underlying issue was, but at this point it’s not >>>> worth pursuing. >>>> >>>> Thanks for responding. >>>> >>>> Brian >>>> >>>> On Sep 8, 2014, at 11:06 AM, Brian Jeltema < >> brian.jelt...@digitalenvoy.net> >>>> wrote: >>>> >>>>> I realized today that the region server logs for the region being >>>> updated (startKey=\x00DDD@) contains the following: >>>>> >>>>> 2014-09-08 06:25:50,223 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 11302 >>>>> 2014-09-08 06:26:00,222 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 21682 >>>>> 2014-09-08 06:26:10,223 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 5724 >>>>> 2014-09-08 06:26:20,223 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 11962 >>>>> 2014-09-08 06:26:30,223 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 7693 >>>>> 2014-09-08 06:26:40,224 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 5578 >>>>> 2014-09-08 06:26:50,223 INFO [regionserver60020.periodicFlusher] >>>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting >>>> flush for region Host,\x00DDD@ >> ,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. >>>> after a delay of 12420 >>>>> >>>>> a log entry being generated every 10 seconds starting about 4 days ago. >>>> I presume these problems are related. >>>>> >>>>> On Sep 8, 2014, at 7:10 AM, Brian Jeltema < >>>> brian.jelt...@digitalenvoy.net> wrote: >>>>> >>>>>> >>>>>>> When number of attempts is greater than the value of >>>>>>> hbase.client.start.log.errors.counter (default 9), AsyncProcess would >>>>>>> produce logs cited below. >>>>>>> The interval following 'retrying after ' is the backoff time. >>>>>>> >>>>>>> Which release of HBase are you using ? >>>>>>> >>>>>> >>>>>> HBase Version 0.98.0.2.1.1.0-385-hadoop2 >>>>>> >>>>>> The MR job is reading from an HBase snapshot, if that’s relevant. >>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> >>>>>>> On Sun, Sep 7, 2014 at 8:50 AM, Brian Jeltema < >>>>>>> brian.jelt...@digitalenvoy.net> wrote: >>>>>>> >>>>>>>> I have a map/reduce job that is consistently failing with timeouts. >>>> The >>>>>>>> failing mapper log files contain a series >>>>>>>> of records similar to those below. When I look at the hbase and hdfs >>>> logs >>>>>>>> (on foo.net in this case) I don’t see >>>>>>>> anything obvious at these timestamps. The mapper task times out >>>> at/near >>>>>>>> attempt=25/35. Can anyone shed light >>>>>>>> on what these log entries mean? >>>>>>>> >>>>>>>> Thanks - Brian >>>>>>>> >>>>>>>> >>>>>>>> 2014-09-07 09:36:51,421 INFO [htable-pool1-t1] >>>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, >> primary, >>>>>>>> attempt=10/35 failed 1062 ops, last exception: null on foo.net >>>> ,60020,1406043467187, >>>>>>>> tracking started null, retrying after 10029 ms, replay 1062 ops >>>>>>>> 2014-09-07 09:37:01,642 INFO [htable-pool1-t1] >>>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, >> primary, >>>>>>>> attempt=11/35 failed 1062 ops, last exception: null on foo.net >>>> ,60020,1406043467187, >>>>>>>> tracking started null, retrying after 10023 ms, replay 1062 ops >>>>>>>> 2014-09-07 09:37:12,064 INFO [htable-pool1-t1] >>>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, >> primary, >>>>>>>> attempt=12/35 failed 1062 ops, last exception: null on foo.net >>>> ,60020,1406043467187, >>>>>>>> tracking started null, retrying after 20182 ms, replay 1062 ops >>>>>>>> 2014-09-07 09:37:32,708 INFO [htable-pool1-t1] >>>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, >> primary, >>>>>>>> attempt=13/35 failed 1062 ops, last exception: null on foo.net >>>> ,60020,1406043467187, >>>>>>>> tracking started null, retrying after 20140 ms, replay 1062 ops >>>>>>>> 2014-09-07 09:37:52,940 INFO [htable-pool1-t1] >>>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, >> primary, >>>>>>>> attempt=14/35 failed 1062 ops, last exception: null on foo.net >>>> ,60020,1406043467187, >>>>>>>> tracking started null, retrying after 20041 ms, replay 1062 ops >>>>>>>> 2014-09-07 09:38:13,324 INFO [htable-pool1-t1] >>>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, >> primary, >>>>>>>> attempt=15/35 failed 1062 ops, last exception: null on foo.net >>>> ,60020,1406043467187, >>>>>>>> tracking started null, retrying after 20041 ms, replay 1062 ops >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>>> >> >>