[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer
[ https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529821#comment-17529821 ] kangTwang commented on HBASE-25768: --- [~filtertip] I've tried this parameter in the environment before. No, it's just a temporary scheme > Support an overall coarse and fast balance strategy for StochasticLoadBalancer > -- > > Key: HBASE-25768 > URL: https://issues.apache.org/jira/browse/HBASE-25768 > Project: HBase > Issue Type: Improvement > Components: Balancer >Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > > When we use StochasticLoadBalancer + balanceByTable, we could face two > difficulties. > # For each table, their regions are distributed uniformly, but for the > overall cluster, still exiting imbalance between RSes; > # When there are large-scaled restart of RSes, or expansion for groups or > cluster, we hope the balancer can execute as soon as possible, but the > StochasticLoadBalancer may need a lot of time to compute costs. > We can detect these circumstances in StochasticLoadBalancer(such as using the > percentage of skew tables), and before the normal balance steps trying, we > can add a strategy to let it just balance like the SimpleLoadBalancer or use > few light cost functions here. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer
[ https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529818#comment-17529818 ] kangTwang commented on HBASE-25768: --- [~Xiaolin Ha] Will there be a patch of HBase 2.1.0 version at present? PR is 3.x version? > Support an overall coarse and fast balance strategy for StochasticLoadBalancer > -- > > Key: HBASE-25768 > URL: https://issues.apache.org/jira/browse/HBASE-25768 > Project: HBase > Issue Type: Improvement > Components: Balancer >Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > > When we use StochasticLoadBalancer + balanceByTable, we could face two > difficulties. > # For each table, their regions are distributed uniformly, but for the > overall cluster, still exiting imbalance between RSes; > # When there are large-scaled restart of RSes, or expansion for groups or > cluster, we hope the balancer can execute as soon as possible, but the > StochasticLoadBalancer may need a lot of time to compute costs. > We can detect these circumstances in StochasticLoadBalancer(such as using the > percentage of skew tables), and before the normal balance steps trying, we > can add a strategy to let it just balance like the SimpleLoadBalancer or use > few light cost functions here. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer
[ https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526990#comment-17526990 ] kangTwang commented on HBASE-25768: --- Hi [~Xiaolin Ha] : Has this PR not been completed yet? > Support an overall coarse and fast balance strategy for StochasticLoadBalancer > -- > > Key: HBASE-25768 > URL: https://issues.apache.org/jira/browse/HBASE-25768 > Project: HBase > Issue Type: Improvement > Components: Balancer >Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > > When we use StochasticLoadBalancer + balanceByTable, we could face two > difficulties. > # For each table, their regions are distributed uniformly, but for the > overall cluster, still exiting imbalance between RSes; > # When there are large-scaled restart of RSes, or expansion for groups or > cluster, we hope the balancer can execute as soon as possible, but the > StochasticLoadBalancer may need a lot of time to compute costs. > We can detect these circumstances in StochasticLoadBalancer(such as using the > percentage of skew tables), and before the normal balance steps trying, we > can add a strategy to let it just balance like the SimpleLoadBalancer or use > few light cost functions here. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?
[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17404873#comment-17404873 ] kangTwang commented on HBASE-20503: --- Hi, is there a solution to this problem? ?I also have this problem here, which makes it impossible to write data > [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL > system stuck? > - > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill > the WAL sub-system? Thanks. > DFS complaining on file open for a few files getting blocks from remote dead > DNs: e.g. {{2018-04-25 10:05:21,506 WARN > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. > java.net.ConnectException: Connection refused}} > AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" . > About ten minutes in, we get this: > {code} > 2018-04-25 10:15:16,532 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > with entries=10819, filesize=7.57 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > 2018-04-25 10:15:16,686 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at >
[jira] [Commented] (HBASE-22657) HBase : STUCK Region-In-Transition
[ https://issues.apache.org/jira/browse/HBASE-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364655#comment-17364655 ] kangTwang commented on HBASE-22657: --- Hi: I also have this problem here. Have you solved it now?? > HBase : STUCK Region-In-Transition > --- > > Key: HBASE-22657 > URL: https://issues.apache.org/jira/browse/HBASE-22657 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: oktay tuncay >Priority: Critical > > When we check the number of regions in transition on Ambari, It shows 1 > transition is waiting. (It's more than 1 in other cluster) > And also, when check the table with command "hbase hbck -details > *table_name*" status looks INCONSISTENT > _There are 0 overlap groups with 0 overlapping regions > ERROR: Found inconsistency in table *Table_Name* > Summary: > Table hbase:meta is okay. > Number of regions: 1 > Deployed on: hostname1:port, hostname2:port, hostname3:port, hostname4:port > Table *Table_Name *is okay. > Number of regions: 39 > Deployed on: hostname1:port, hostname2:port, hostname3:port, hostname4:port > 2 inconsistencies detected. > Status: *INCONSISTENT* > When I check the logfiles, I saw following warning messages, > 2019-06-09T07:14:15.179+02:00 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK > Region-In-Transition rit=CLOSING, location=*hostname*,*port*,1558699727048, > table=*table_name*, region=c67dd5d8bcd174cc2001695c31475ab1 > According this message, region c67dd5d8bcd174cc2001695c31475ab1 try to assign > *host* but this operation is stuck. > We stopped RS process on *host* and force assign to another RS which are > running. > *hbase(main):001:0> assign 'c67dd5d8bcd174cc2001695c31475ab1'* > After that operaion, INCONSISTENT has gone and we re-started RS on host. > One of the reasons why a region gets stuck in transition is because, when it > is being moved across regionservers, it is unassigned from the source > regionserver but is never assigned to another regionserver > I think Below code is responsible for that process. > private void handleRegionOverStuckWarningThreshold(final RegionInfo > regionInfo) { > final RegionStateNode regionNode = > regionStates.getRegionStateNode(regionInfo); > //if (regionNode.isStuck()) { > LOG.warn("STUCK Region-In-Transition {}", regionNode);_ > It seems one potential way of unstuck the region is to send close request to > the region server. May be blocked because another Procedure holds the > exclusive lock and is not letting go. > My question is what is the root cause for this problem and I think, HBase > should be able to fix region-In-Transition issue. > We can fix this problem by manual but some customer does not have this > knowledge and I think HBase needs to be recover itself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25661) Unable rename error occurred in AbstractFSWAL, causing regionserver to crash
kangTwang created HBASE-25661: - Summary: Unable rename error occurred in AbstractFSWAL, causing regionserver to crash Key: HBASE-25661 URL: https://issues.apache.org/jira/browse/HBASE-25661 Project: HBase Issue Type: Bug Components: API Affects Versions: 2.1.0 Reporter: kangTwang Fix For: 2.1.0 The error is as follows: [ERROR] - org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:631) - Cache flush failed for region test_2,0293601280,1614762174258.030beae347d51a5fb6782f6cb025f763. [ERROR] - org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:631) - Cache flush failed for region test_2,0293601280,1614762174258.030beae347d51a5fb6782f6cb025f763.java.io.IOException: WAL has been closed at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:698) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:817) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.LogRoller.abort(LogRoller.java:143) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:201) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at java.lang.Thread.run(Thread.java:834) ~[?:?][17:11:59:664] [INFO] - org.apache.hadoop.hbase.regionserver.HRegion.logFatLineOnFlush(HRegion.java:2636) - Flushing 1/1 column families, dataSize=127.76 MB heapSize=136.99 MB[17:11:59:665] [WARN] - org.apache.hadoop.hbase.regionserver.HRegion.doAbortFlushToWAL(HRegion.java:2652) - Received unexpected exception trying to write ABORT_FLUSH marker to WAL:java.io.IOException: Cannot append; log is closed, regionName = test_2,0377487360,1614762174258.146bbdf3caa203124cd039e48dd3e344. at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(AbstractFSWAL.java:962) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(AsyncFSWAL.java:563) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:156) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:85) at org.apache.hadoop.hbase.regionserver.HRegion.doAbortFlushToWAL(HRegion.java:2649) at org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2599) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2465) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:361) at java.base/java.lang.Thread.run(Thread.java:834) [17:11:59:665] [ERROR] - org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:631) - Cache flush failed for region test_2,0377487360,1614762174258.146bbdf3caa203124cd039e48dd3e344.java.io.IOException: Cannot append; log is closed, regionName = test_2,0377487360,1614762174258.146bbdf3caa203124cd039e48dd3e344. at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(AbstractFSWAL.java:962) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(AsyncFSWAL.java:563) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:156) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:85) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2588) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2465) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581) ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at