@Ted. the log i paste is in INFO Level, i have change it to be DEBUG Level yesterday, If this happen again, i will upload the debug level log.
2016-05-30 21:57 GMT+08:00 Ted Yu <[email protected]>: > There is debug log in HRegion#replayWALFlushStartMarker : > > LOG.debug(getRegionInfo().getEncodedName() + " : " > > + " Prepared flush with seqId:" + > flush.getFlushSequenceNumber()); > > ... > > LOG.debug(getRegionInfo().getEncodedName() + " : " > > + " Prepared empty flush with seqId:" + > flush.getFlushSequenceNumber()); > > I searched for them in the log you attached to HBASE-15900 but didn't find > any occurrence. > > FYI > > On Mon, May 30, 2016 at 2:59 AM, Heng Chen <[email protected]> > wrote: > > > I find something useful. > > > > When we do region.close, if there is one compaction or flush in > progress, > > close will wait for compaction or flush be finished. > > > > {code: title=HRegion.java} > > > > @Override > > public void waitForFlushesAndCompactions() { > > synchronized (writestate) { > > if (this.writestate.readOnly) { > > // we should not wait for replayed flushed if we are read only > > (for example in case the > > // region is a secondary replica). > > return; > > } > > boolean interrupted = false; > > try { > > while (writestate.compacting > 0 || writestate.flushing) { > > LOG.debug("waiting for " + writestate.compacting + " compactions" > > + (writestate.flushing ? " & cache flush" : "") + " to > > complete for region " + this); > > try { > > writestate.wait(); > > } catch (InterruptedException iex) { > > // essentially ignore and propagate the interrupt back up > > LOG.warn("Interrupted while waiting"); > > interrupted = true; > > } > > } > > } finally { > > if (interrupted) { > > Thread.currentThread().interrupt(); > > } > > } > > } > > } > > > > {code} > > > > And writestate.flushing will be set to be true in two place: > > > > HRegion.flushCache and HRegion.replayWALFlushStartMarker > > > > {code: title=HRegion.flushCache} > > > > synchronized (writestate) { > > if (!writestate.flushing && writestate.writesEnabled) { > > this.writestate.flushing = true; > > } else { > > *....* > > } > > } > > > > {code} > > > > {code: title=HRegion.replayWALFlushStartMarker} > > > > synchronized (writestate) { > > try { > > *....* > > if (!writestate.flushing) { > > > > this.writestate.flushing = true; > > *...* > > > > * }* > > > > {code} > > > > > > Notice that, in HRegion.replayWALFlushStartMarker, we did not check > > writestate.writesEnabled before set writestate.flushing to be true. > > > > So if region.close wake up in writestate.wait, but the lock acquried by > > HRegion.replayWALFlushStartMarker, the flushing will be set to be true > > again, and region.close will stuck in writestate.wait forever. > > > > > > Will it happen in real logical? > > > > > > 2016-05-27 10:44 GMT+08:00 Heng Chen <[email protected]>: > > > > > Thanks guys, yesterday i restart relate RS and failed close region > > reopen > > > successfuly. But today, there is another region fall in this state. > > > > > > I paste relate RS' jstack information. This time the failed close > region > > > is 9368190b3ba46238534b6307702aabae > > > > > > 2016-05-26 21:50 GMT+08:00 Ted Yu <[email protected]>: > > > > > >> Heng: > > >> Can you pastebin the complete stack trace for the region server ? > > >> > > >> Snippet from region server log may also provide more clue. > > >> > > >> Thanks > > >> > > >> On Wed, May 25, 2016 at 9:48 PM, Heng Chen <[email protected]> > > >> wrote: > > >> > > >> > On master web UI, i could see region > > (c371fb20c372b8edbf54735409ab5c4a) > > >> > always in failed close state, So balancer could not run. > > >> > > > >> > > > >> > i check the region on RS, and found logs about this region > > >> > > > >> > 2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1] > > >> > regionserver.MemStoreFlusher: Waited 90447ms on a compaction to > clean > > up > > >> > 'too many store files'; waited long enough... proceeding with flush > of > > >> > > > >> > > > >> > > > frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. > > >> > 2016-05-26 12:42:20,043 INFO > > >> > [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1] > > >> > regionserver.HRegionServer: > > >> > > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore > > >> > requesting flush for region > > >> > > > >> > > > >> > > > frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. > > >> > after a delay of 20753 > > >> > 2016-05-26 12:42:30,043 INFO > > >> > [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1] > > >> > regionserver.HRegionServer: > > >> > > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore > > >> > requesting flush for region > > >> > > > >> > > > >> > > > frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. > > >> > after a delay of 7057 > > >> > > > >> > > > >> > relate jstack information like below: > > >> > > > >> > Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2): > > >> > State: WAITING > > >> > Blocked count: 1 > > >> > Waited count: 2 > > >> > Waiting on > > >> > org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c > > >> > Stack: > > >> > java.lang.Object.wait(Native Method) > > >> > java.lang.Object.wait(Object.java:502) > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512) > > >> > > > >> > org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371) > > >> > > > >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336) > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138) > > >> > > > >> > > > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > >> > java.lang.Thread.run(Thread.java:745) > > >> > > > >> > > > >> > Our HBase cluster version is 1.1.1, i try to compact this region, > > >> > compact stuck in progress 89.58% > > >> > > > >> > > > >> > > > >> > > > frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. > > >> > 85860221 85860221 > > >> > 89.58% > > >> > > > >> > > > > > > > > >
