[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer

2022-04-29 Thread kangTwang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529821#comment-17529821
 ] 

kangTwang commented on HBASE-25768:
---

[~filtertip] 

I've tried this parameter in the environment before. No, it's just a temporary 
scheme

> Support an overall coarse and fast balance strategy for StochasticLoadBalancer
> --
>
> Key: HBASE-25768
> URL: https://issues.apache.org/jira/browse/HBASE-25768
> Project: HBase
>  Issue Type: Improvement
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When we use StochasticLoadBalancer + balanceByTable, we could face two 
> difficulties.
>  # For each table, their regions are distributed uniformly, but for the 
> overall cluster, still exiting imbalance between RSes;
>  # When there are large-scaled restart of RSes, or expansion for groups or 
> cluster, we hope the balancer can execute as soon as possible, but the 
> StochasticLoadBalancer may need a lot of time to compute costs.
> We can detect these circumstances in StochasticLoadBalancer(such as using the 
> percentage of skew tables), and before the normal balance steps trying, we 
> can add a strategy to let it just balance like the SimpleLoadBalancer or use 
> few light cost functions here.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer

2022-04-29 Thread kangTwang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529818#comment-17529818
 ] 

kangTwang commented on HBASE-25768:
---

[~Xiaolin Ha]  Will there be a patch of HBase 2.1.0  version at present?  PR is 
3.x version?

> Support an overall coarse and fast balance strategy for StochasticLoadBalancer
> --
>
> Key: HBASE-25768
> URL: https://issues.apache.org/jira/browse/HBASE-25768
> Project: HBase
>  Issue Type: Improvement
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When we use StochasticLoadBalancer + balanceByTable, we could face two 
> difficulties.
>  # For each table, their regions are distributed uniformly, but for the 
> overall cluster, still exiting imbalance between RSes;
>  # When there are large-scaled restart of RSes, or expansion for groups or 
> cluster, we hope the balancer can execute as soon as possible, but the 
> StochasticLoadBalancer may need a lot of time to compute costs.
> We can detect these circumstances in StochasticLoadBalancer(such as using the 
> percentage of skew tables), and before the normal balance steps trying, we 
> can add a strategy to let it just balance like the SimpleLoadBalancer or use 
> few light cost functions here.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer

2022-04-24 Thread kangTwang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526990#comment-17526990
 ] 

kangTwang commented on HBASE-25768:
---

Hi [~Xiaolin Ha] :

Has this PR not been completed yet?

> Support an overall coarse and fast balance strategy for StochasticLoadBalancer
> --
>
> Key: HBASE-25768
> URL: https://issues.apache.org/jira/browse/HBASE-25768
> Project: HBase
>  Issue Type: Improvement
>  Components: Balancer
>Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When we use StochasticLoadBalancer + balanceByTable, we could face two 
> difficulties.
>  # For each table, their regions are distributed uniformly, but for the 
> overall cluster, still exiting imbalance between RSes;
>  # When there are large-scaled restart of RSes, or expansion for groups or 
> cluster, we hope the balancer can execute as soon as possible, but the 
> StochasticLoadBalancer may need a lot of time to compute costs.
> We can detect these circumstances in StochasticLoadBalancer(such as using the 
> percentage of skew tables), and before the normal balance steps trying, we 
> can add a strategy to let it just balance like the SimpleLoadBalancer or use 
> few light cost functions here.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-20503) [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL system stuck?

2021-08-25 Thread kangTwang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17404873#comment-17404873
 ] 

kangTwang commented on HBASE-20503:
---

Hi, is there a solution to this problem? ?I also have this problem here, which 
makes it impossible to write data

> [AsyncFSWAL] Failed to get sync result after 30 ms for txid=160912, WAL 
> system stuck?
> -
>
> Key: HBASE-20503
> URL: https://issues.apache.org/jira/browse/HBASE-20503
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Michael Stack
>Priority: Major
> Attachments: 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, 
> 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch
>
>
> Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to 
> furiously open regions assigned by Master. It is importantly carrying 
> hbase:meta. Twenty minutes in, meta goes dead after an exception up out 
> AsyncFSWAL. Process had been restarted so I couldn't get a  thread dump. 
> Suspicious is we archive a WAL and we get a FNFE because we got to access WAL 
> in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill 
> the WAL sub-system? Thanks.
> DFS complaining on file open for a few files getting blocks from remote dead 
> DNs: e.g. {{2018-04-25 10:05:21,506 WARN 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader.
> java.net.ConnectException: Connection refused}}
> AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" .
> About ten minutes in, we get this:
> {code}
> 2018-04-25 10:15:16,532 WARN 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed
> java.io.IOException: stream already broken
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424)
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513)
>   
>   
>   
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2018-04-25 10:15:16,680 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL 
> /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta
>  with entries=10819, filesize=7.57 MB; new WAL 
> /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta
> 2018-04-25 10:15:16,680 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving 
> hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta
>  to 
> hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta
> 2018-04-25 10:15:16,686 WARN 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to 
> write trailer, non-fatal, continuing...
> java.io.IOException: stream already broken
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424)
>   at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742)
>   at 
> 

[jira] [Commented] (HBASE-22657) HBase : STUCK Region-In-Transition

2021-06-16 Thread kangTwang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364655#comment-17364655
 ] 

kangTwang commented on HBASE-22657:
---

Hi:

    I also have this problem here. Have you solved it now??

> HBase : STUCK Region-In-Transition 
> ---
>
> Key: HBASE-22657
> URL: https://issues.apache.org/jira/browse/HBASE-22657
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: oktay tuncay
>Priority: Critical
>
> When we check the number of regions in transition on Ambari, It shows 1 
> transition is waiting. (It's more than 1 in other cluster)
> And also, when check the table with command "hbase hbck -details 
> *table_name*" status looks INCONSISTENT
> _There are 0 overlap groups with 0 overlapping regions
> ERROR: Found inconsistency in table *Table_Name*
> Summary:
> Table hbase:meta is okay.
> Number of regions: 1
> Deployed on: hostname1:port, hostname2:port, hostname3:port, hostname4:port
> Table *Table_Name *is okay.
> Number of regions: 39
> Deployed on: hostname1:port, hostname2:port, hostname3:port, hostname4:port
> 2 inconsistencies detected.
> Status: *INCONSISTENT*
> When I check the logfiles, I saw following warning messages,
> 2019-06-09T07:14:15.179+02:00 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition rit=CLOSING, location=*hostname*,*port*,1558699727048, 
> table=*table_name*, region=c67dd5d8bcd174cc2001695c31475ab1
> According this message, region c67dd5d8bcd174cc2001695c31475ab1 try to assign 
> *host* but this operation is stuck.
> We stopped RS process on *host* and force assign to another RS which are 
> running.
> *hbase(main):001:0> assign 'c67dd5d8bcd174cc2001695c31475ab1'*
> After that operaion, INCONSISTENT has gone and we re-started RS on host.
> One of the reasons why a region gets stuck in transition is because, when it 
> is being moved across regionservers, it is unassigned from the source 
> regionserver but is never assigned to another regionserver
> I think Below code is responsible for that process. 
> private void handleRegionOverStuckWarningThreshold(final RegionInfo 
> regionInfo) {
> final RegionStateNode regionNode = 
> regionStates.getRegionStateNode(regionInfo);
> //if (regionNode.isStuck()) {
> LOG.warn("STUCK Region-In-Transition {}", regionNode);_
> It seems one potential way of unstuck the region is to send close request to 
> the region server. May be blocked because another Procedure holds the 
> exclusive lock and is not letting go.
> My question is what is the root cause for this problem and I think, HBase 
> should be able to fix region-In-Transition issue.
> We can fix this problem by manual but some customer does not have this 
> knowledge and I think HBase needs to be recover itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25661) Unable rename error occurred in AbstractFSWAL, causing regionserver to crash

2021-03-12 Thread kangTwang (Jira)
kangTwang created HBASE-25661:
-

 Summary: Unable rename error occurred in AbstractFSWAL, causing 
regionserver to crash
 Key: HBASE-25661
 URL: https://issues.apache.org/jira/browse/HBASE-25661
 Project: HBase
  Issue Type: Bug
  Components: API
Affects Versions: 2.1.0
Reporter: kangTwang
 Fix For: 2.1.0


The error is as follows:

[ERROR] - 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:631)
 - Cache flush failed for region 
test_2,0293601280,1614762174258.030beae347d51a5fb6782f6cb025f763.

[ERROR] - 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:631)
 - Cache flush failed for region 
test_2,0293601280,1614762174258.030beae347d51a5fb6782f6cb025f763.java.io.IOException:
 WAL has been closed at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:698)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:817)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.LogRoller.abort(LogRoller.java:143) 
~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:201) 
~[hbase-server-2.1.0-cdh6.3.0.jar:?] at java.lang.Thread.run(Thread.java:834) 
~[?:?][17:11:59:664] [INFO] - 
org.apache.hadoop.hbase.regionserver.HRegion.logFatLineOnFlush(HRegion.java:2636)
 - Flushing 1/1 column families, dataSize=127.76 MB heapSize=136.99 
MB[17:11:59:665] [WARN] - 
org.apache.hadoop.hbase.regionserver.HRegion.doAbortFlushToWAL(HRegion.java:2652)
 - Received unexpected exception trying to write ABORT_FLUSH marker to 
WAL:java.io.IOException: Cannot append; log is closed, regionName = 
test_2,0377487360,1614762174258.146bbdf3caa203124cd039e48dd3e344.
 at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(AbstractFSWAL.java:962)
 at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(AsyncFSWAL.java:563) 
at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:156)
 at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:85)
 at 
org.apache.hadoop.hbase.regionserver.HRegion.doAbortFlushToWAL(HRegion.java:2649)
 at 
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2599)
 at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2465)
 at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439)
 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) 
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612)
 at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581)
 at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68)
 at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:361)
 at java.base/java.lang.Thread.run(Thread.java:834)
[17:11:59:665] [ERROR] - 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:631)
 - Cache flush failed for region 
test_2,0377487360,1614762174258.146bbdf3caa203124cd039e48dd3e344.java.io.IOException:
 Cannot append; log is closed, regionName = 
test_2,0377487360,1614762174258.146bbdf3caa203124cd039e48dd3e344.
 at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(AbstractFSWAL.java:962)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(AsyncFSWAL.java:563) 
~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:156)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:85)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2588)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2465)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) 
~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581)
 ~[hbase-server-2.1.0-cdh6.3.0.jar:?] at