[jira] [Resolved] (HBASE-28016) hbck2 should support change region state of meta
[ https://issues.apache.org/jira/browse/HBASE-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-28016. Resolution: Won't Fix > hbck2 should support change region state of meta > > > Key: HBASE-28016 > URL: https://issues.apache.org/jira/browse/HBASE-28016 > Project: HBase > Issue Type: Improvement > Components: hbase-operator-tools, hbck2 >Affects Versions: hbase-operator-tools-1.2.0 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The region state of meta is stored in zk, if the state is wrong, we need a > way to change it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28016) hbck2 should support change region state of meta
[ https://issues.apache.org/jira/browse/HBASE-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766354#comment-17766354 ] Zheng Wang commented on HBASE-28016: Changed another way to do it. > hbck2 should support change region state of meta > > > Key: HBASE-28016 > URL: https://issues.apache.org/jira/browse/HBASE-28016 > Project: HBase > Issue Type: Improvement > Components: hbase-operator-tools, hbck2 >Affects Versions: hbase-operator-tools-1.2.0 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The region state of meta is stored in zk, if the state is wrong, we need a > way to change it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755311#comment-17755311 ] Zheng Wang commented on HBASE-26987: I updated this PR, could you help to review it? Thanks. [~zhangduo] > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28016) hbck2 should support change region state of meta
Zheng Wang created HBASE-28016: -- Summary: hbck2 should support change region state of meta Key: HBASE-28016 URL: https://issues.apache.org/jira/browse/HBASE-28016 Project: HBase Issue Type: Improvement Components: hbase-operator-tools, hbck2 Affects Versions: hbase-operator-tools-1.2.0 Reporter: Zheng Wang Assignee: Zheng Wang The region state of meta is stored in zk, if the state is wrong, we need a way to change it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747710#comment-17747710 ] Zheng Wang edited comment on HBASE-27805 at 7/27/23 2:07 AM: - In this issue, we just updated the doc and provided a way to workaround. was (Author: filtertip): In this issue, we just updated the documentation and provided a way to workaround. > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: documentation >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in G1, humongous objects are objects larger or equal the size of half a > region, and the heapRegionSize is automatically calculated based on the heap > size parameter if not explicitly specified. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747710#comment-17747710 ] Zheng Wang commented on HBASE-27805: In this issue, we just updated the documentation and provided a way to workaround. > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: documentation >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in G1, humongous objects are objects larger or equal the size of half a > region, and the heapRegionSize is automatically calculated based on the heap > size parameter if not explicitly specified. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-27805. Resolution: Fixed > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: documentation >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in G1, humongous objects are objects larger or equal the size of half a > region, and the heapRegionSize is automatically calculated based on the heap > size parameter if not explicitly specified. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Component/s: documentation (was: regionserver) > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: documentation >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in G1, humongous objects are objects larger or equal the size of half a > region, and the heapRegionSize is automatically calculated based on the heap > size parameter if not explicitly specified. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743788#comment-17743788 ] Zheng Wang commented on HBASE-26987: Yeah, that's what the problem is. One thing to consider about the solution is that if the number of files to be compacted is too large and there is only one task in the queue, it will not reflect the actual situation. [~zhangduo] > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27964) Adds a switch for compaction's delay selection feature
[ https://issues.apache.org/jira/browse/HBASE-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740578#comment-17740578 ] Zheng Wang commented on HBASE-27964: Here is a test. I write data to a cluster which has two regionServers, and limit the compaction throughput, the node9(10.0.0.9) apply the patch and disable delayed selection, we can see the store_file_count and store_file_size grows similar, but the compaction_queue_length has big difference, the node23(10.0.0.23) obviously incorrect. !image-2023-07-06-19-51-21-354.png|width=503,height=192! !image-2023-07-06-20-00-21-587.png|width=502,height=190! !image-2023-07-06-20-00-41-933.png|width=501,height=192! > Adds a switch for compaction's delay selection feature > -- > > Key: HBASE-27964 > URL: https://issues.apache.org/jira/browse/HBASE-27964 > Project: HBase > Issue Type: Improvement > Components: Compaction >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2023-07-06-19-51-21-354.png, > image-2023-07-06-20-00-21-587.png, image-2023-07-06-20-00-41-933.png > > > When the compact pressure is high, delayed selection can cause the metric of > compact queue length to continuously increase incorrectly. We should provide > an option to disable this feature if the user values metric accuracy more. > See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27964) Adds a switch for compaction's delay selection feature
[ https://issues.apache.org/jira/browse/HBASE-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27964: --- Attachment: image-2023-07-06-20-00-21-587.png > Adds a switch for compaction's delay selection feature > -- > > Key: HBASE-27964 > URL: https://issues.apache.org/jira/browse/HBASE-27964 > Project: HBase > Issue Type: Improvement > Components: Compaction >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2023-07-06-19-51-21-354.png, > image-2023-07-06-20-00-21-587.png, image-2023-07-06-20-00-41-933.png > > > When the compact pressure is high, delayed selection can cause the metric of > compact queue length to continuously increase incorrectly. We should provide > an option to disable this feature if the user values metric accuracy more. > See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27964) Adds a switch for compaction's delay selection feature
[ https://issues.apache.org/jira/browse/HBASE-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27964: --- Attachment: image-2023-07-06-20-00-41-933.png > Adds a switch for compaction's delay selection feature > -- > > Key: HBASE-27964 > URL: https://issues.apache.org/jira/browse/HBASE-27964 > Project: HBase > Issue Type: Improvement > Components: Compaction >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2023-07-06-19-51-21-354.png, > image-2023-07-06-20-00-21-587.png, image-2023-07-06-20-00-41-933.png > > > When the compact pressure is high, delayed selection can cause the metric of > compact queue length to continuously increase incorrectly. We should provide > an option to disable this feature if the user values metric accuracy more. > See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27964) Adds a switch for compaction's delay selection feature
[ https://issues.apache.org/jira/browse/HBASE-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27964: --- Attachment: image-2023-07-06-19-51-21-354.png > Adds a switch for compaction's delay selection feature > -- > > Key: HBASE-27964 > URL: https://issues.apache.org/jira/browse/HBASE-27964 > Project: HBase > Issue Type: Improvement > Components: Compaction >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2023-07-06-19-51-21-354.png > > > When the compact pressure is high, delayed selection can cause the metric of > compact queue length to continuously increase incorrectly. We should provide > an option to disable this feature if the user values metric accuracy more. > See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HBASE-27964) Adds a switch for compaction's delay selection feature
[ https://issues.apache.org/jira/browse/HBASE-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740515#comment-17740515 ] Zheng Wang edited comment on HBASE-27964 at 7/6/23 9:27 AM: Yeah, I think so. Consider this patch not fix it, just add a switch, so i set the type as improvement. [~zhangduo] was (Author: filtertip): Yeah, I think so. [~zhangduo] > Adds a switch for compaction's delay selection feature > -- > > Key: HBASE-27964 > URL: https://issues.apache.org/jira/browse/HBASE-27964 > Project: HBase > Issue Type: Improvement > Components: Compaction >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > When the compact pressure is high, delayed selection can cause the metric of > compact queue length to continuously increase incorrectly. We should provide > an option to disable this feature if the user values metric accuracy more. > See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27964) Adds a switch for compaction's delay selection feature
[ https://issues.apache.org/jira/browse/HBASE-27964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740515#comment-17740515 ] Zheng Wang commented on HBASE-27964: Yeah, I think so. [~zhangduo] > Adds a switch for compaction's delay selection feature > -- > > Key: HBASE-27964 > URL: https://issues.apache.org/jira/browse/HBASE-27964 > Project: HBase > Issue Type: Improvement > Components: Compaction >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > When the compact pressure is high, delayed selection can cause the metric of > compact queue length to continuously increase incorrectly. We should provide > an option to disable this feature if the user values metric accuracy more. > See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27964) Adds a switch for compaction's delay selection feature
Zheng Wang created HBASE-27964: -- Summary: Adds a switch for compaction's delay selection feature Key: HBASE-27964 URL: https://issues.apache.org/jira/browse/HBASE-27964 Project: HBase Issue Type: Improvement Components: Compaction Reporter: Zheng Wang Assignee: Zheng Wang When the compact pressure is high, delayed selection can cause the metric of compact queue length to continuously increase incorrectly. We should provide an option to disable this feature if the user values metric accuracy more. See HBASE-26987 for more detail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721157#comment-17721157 ] Zheng Wang commented on HBASE-27788: Pushed to master and branch-2. > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-27788. Resolution: Fixed > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27788: --- Fix Version/s: 2.6.0 > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang reopened HBASE-27788: > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 3.0.0-alpha-4 > > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721148#comment-17721148 ] Zheng Wang commented on HBASE-27788: Oh, forgot it, will do it later. > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 3.0.0-alpha-4 > > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-27788. Fix Version/s: 3.0.0-alpha-4 Resolution: Fixed > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 3.0.0-alpha-4 > > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721143#comment-17721143 ] Zheng Wang commented on HBASE-27788: Thanks for all the comments. [~zhangduo] [~bbeaudreault] > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27788: --- Attachment: BenchmarkForInnerStore.java BenchmarkForNormal.java > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: BenchmarkForInnerStore.java, BenchmarkForNormal.java > > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712226#comment-17712226 ] Zheng Wang edited comment on HBASE-27788 at 4/23/23 4:01 AM: - Test Env: linux5.4, jdk8, jmh1.36, 8c16g Test Cmd: java -jar benchmarks.jar -i 5 -r 10 -wi 5 -w 10 -o result.out Test Mode: throughput, the more the better |Benchmark|(p1)|(p2)|Mode|Cnt|Score|Error|Units|Diff| |BenchmarkForInnerStore.new_compareBBKV| | |thrpt|5|28025769|± 85837.894|ops/s|1.00%| |BenchmarkForInnerStore.new_compareBBKV| |fam1|thrpt|5|45988795|± 743418.588|ops/s|14.00%| |BenchmarkForInnerStore.new_compareBBKV|fam1| |thrpt|5|46169746|± 313848.117|ops/s|15.00%| |BenchmarkForInnerStore.new_compareBBKV|fam1|fam1|thrpt|5|28340570|± 110743.597|ops/s|19.00%| |BenchmarkForInnerStore.new_compareKV| | |thrpt|5|28555080|± 137117.752|ops/s|1.00%| |BenchmarkForInnerStore.new_compareKV| |fam1|thrpt|5|48428310|± 457635.029|ops/s|12.00%| |BenchmarkForInnerStore.new_compareKV|fam1| |thrpt|5|48493949|± 251767.842|ops/s|12.00%| |BenchmarkForInnerStore.new_compareKV|fam1|fam1|thrpt|5|28550667|± 115741.387|ops/s|27.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV| | |thrpt|5|29217290|± 101649.947|ops/s|6.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV| |fam1|thrpt|5|46949029|± 215794.996|ops/s|7.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV|fam1| |thrpt|5|46946670|± 146710.467|ops/s|7.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV|fam1|fam1|thrpt|5|29148782|± 206963.662|ops/s|20.00%| |BenchmarkForInnerStore.old_compareBBKV| | |thrpt|5|27675873|± 276983.891|ops/s| | |BenchmarkForInnerStore.old_compareBBKV| |fam1|thrpt|5|40225985|± 333777.174|ops/s| | |BenchmarkForInnerStore.old_compareBBKV|fam1| |thrpt|5|40187512|± 242635.903|ops/s| | |BenchmarkForInnerStore.old_compareBBKV|fam1|fam1|thrpt|5|23719010|± 78500.923|ops/s| | |BenchmarkForInnerStore.old_compareKV| | |thrpt|5|28263508|± 80403.361|ops/s| | |BenchmarkForInnerStore.old_compareKV| |fam1|thrpt|5|43253529|± 227223.861|ops/s| | |BenchmarkForInnerStore.old_compareKV|fam1| |thrpt|5|43251637|± 370669.972|ops/s| | |BenchmarkForInnerStore.old_compareKV|fam1|fam1|thrpt|5|22556530|± 131922.278|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV| | |thrpt|5|27607601|± 181466.155|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV| |fam1|thrpt|5|43838946|± 147828.804|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV|fam1| |thrpt|5|43853799|± 159898.926|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV|fam1|fam1|thrpt|5|24349838|± 233577.807|ops/s| | |BenchmarkForNormal.new_compareBBKV|fam|fam1|thrpt|5|35397671|± 87148.764|ops/s|-3.00%| |BenchmarkForNormal.new_compareBBKV|fam1|fam|thrpt|5|34853758|± 181728.193|ops/s|-4.00%| |BenchmarkForNormal.new_compareKV|fam|fam1|thrpt|5|36379792|± 103787.745|ops/s|0.00%| |BenchmarkForNormal.new_compareKV|fam1|fam|thrpt|5|36389642|± 215220.231|ops/s|0.00%| |BenchmarkForNormal.new_compareKVVsBBKV|fam|fam1|thrpt|5|39477917|± 116925.262|ops/s|0.00%| |BenchmarkForNormal.new_compareKVVsBBKV|fam1|fam|thrpt|5|39381461|± 196771.635|ops/s|0.00%| |BenchmarkForNormal.old_compareBBKV|fam|fam1|thrpt|5|36419133|± 159715.504|ops/s| | |BenchmarkForNormal.old_compareBBKV|fam1|fam|thrpt|5|36422202|± 86247.635|ops/s| | |BenchmarkForNormal.old_compareKV|fam|fam1|thrpt|5|36247387|± 95109.893|ops/s| | |BenchmarkForNormal.old_compareKV|fam1|fam|thrpt|5|36260304|± 63266.840|ops/s| | |BenchmarkForNormal.old_compareKVVsBBKV|fam|fam1|thrpt|5|39326233|± 218927.739|ops/s| | |BenchmarkForNormal.old_compareKVVsBBKV|fam1|fam|thrpt|5|39297932|± 487026.618|ops/s| | was (Author: filtertip): Test Env: linux5.4, jdk8, jmh1.36, 8c16g Test Cmd: java -jar benchmarks.jar -i 5 -r 10 -wi 5 -w 10 -o result.out Test Mode: throughput, the more the better |Benchmark|(p1)|(p2)|Mode|Cnt|Score|Error|Units|Diff| |BenchmarkForInnerStore.new_compareBBKV| | |thrpt|5|28025769|± 85837.894|ops/s|1.00%| |BenchmarkForInnerStore.new_compareBBKV| |fam1|thrpt|5|45988795|± 743418.588|ops/s|14.00%| |BenchmarkForInnerStore.new_compareBBKV|fam1| |thrpt|5|46169746|± 313848.117|ops/s|15.00%| |BenchmarkForInnerStore.new_compareBBKV|fam1|fam1|thrpt|5|28340570|± 110743.597|ops/s|19.00%| |BenchmarkForInnerStore.new_compareKV| | |thrpt|5|28555080|± 137117.752|ops/s|1.00%| |BenchmarkForInnerStore.new_compareKV| |fam1|thrpt|5|48428310|± 457635.029|ops/s|12.00%| |BenchmarkForInnerStore.new_compareKV|fam1| |thrpt|5|48493949|± 251767.842|ops/s|12.00%| |BenchmarkForInnerStore.new_compareKV|fam1|fam1|thrpt|5|28550667|± 115741.387|ops/s|27.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV| | |thrpt|5|29217290|± 101649.947|ops/s|6.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV| |fam1|thrpt|5|46949029|± 215794.996|ops/s|7.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV|fam1| |thrpt|5|46946670|± 146710.467|ops/s|7.00%|
[jira] [Comment Edited] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712226#comment-17712226 ] Zheng Wang edited comment on HBASE-27788 at 4/22/23 3:34 AM: - Test Env: linux5.4, jdk8, jmh1.36, 8c16g Test Cmd: java -jar benchmarks.jar -i 5 -r 10 -wi 5 -w 10 -o result.out Test Mode: throughput, the more the better |Benchmark|(p1)|(p2)|Mode|Cnt|Score|Error|Units|Diff| |BenchmarkForInnerStore.new_compareBBKV| | |thrpt|5|28025769|± 85837.894|ops/s|1.00%| |BenchmarkForInnerStore.new_compareBBKV| |fam1|thrpt|5|45988795|± 743418.588|ops/s|14.00%| |BenchmarkForInnerStore.new_compareBBKV|fam1| |thrpt|5|46169746|± 313848.117|ops/s|15.00%| |BenchmarkForInnerStore.new_compareBBKV|fam1|fam1|thrpt|5|28340570|± 110743.597|ops/s|19.00%| |BenchmarkForInnerStore.new_compareKV| | |thrpt|5|28555080|± 137117.752|ops/s|1.00%| |BenchmarkForInnerStore.new_compareKV| |fam1|thrpt|5|48428310|± 457635.029|ops/s|12.00%| |BenchmarkForInnerStore.new_compareKV|fam1| |thrpt|5|48493949|± 251767.842|ops/s|12.00%| |BenchmarkForInnerStore.new_compareKV|fam1|fam1|thrpt|5|28550667|± 115741.387|ops/s|27.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV| | |thrpt|5|29217290|± 101649.947|ops/s|6.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV| |fam1|thrpt|5|46949029|± 215794.996|ops/s|7.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV|fam1| |thrpt|5|46946670|± 146710.467|ops/s|7.00%| |BenchmarkForInnerStore.new_compareKVVsBBKV|fam1|fam1|thrpt|5|29148782|± 206963.662|ops/s|20.00%| |BenchmarkForInnerStore.old_compareBBKV| | |thrpt|5|27675873|± 276983.891|ops/s| | |BenchmarkForInnerStore.old_compareBBKV| |fam1|thrpt|5|40225985|± 333777.174|ops/s| | |BenchmarkForInnerStore.old_compareBBKV|fam1| |thrpt|5|40187512|± 242635.903|ops/s| | |BenchmarkForInnerStore.old_compareBBKV|fam1|fam1|thrpt|5|23719010|± 78500.923|ops/s| | |BenchmarkForInnerStore.old_compareKV| | |thrpt|5|28263508|± 80403.361|ops/s| | |BenchmarkForInnerStore.old_compareKV| |fam1|thrpt|5|43253529|± 227223.861|ops/s| | |BenchmarkForInnerStore.old_compareKV|fam1| |thrpt|5|43251637|± 370669.972|ops/s| | |BenchmarkForInnerStore.old_compareKV|fam1|fam1|thrpt|5|22556530|± 131922.278|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV| | |thrpt|5|27607601|± 181466.155|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV| |fam1|thrpt|5|43838946|± 147828.804|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV|fam1| |thrpt|5|43853799|± 159898.926|ops/s| | |BenchmarkForInnerStore.old_compareKVVsBBKV|fam1|fam1|thrpt|5|24349838|± 233577.807|ops/s| | |BenchmarkForNormal.new_compareBBKV|fam|fam|thrpt|5|24232184|± 119228.533|ops/s|-1.00%| |BenchmarkForNormal.new_compareBBKV|fam|fam1|thrpt|5|35397671|± 87148.764|ops/s|-3.00%| |BenchmarkForNormal.new_compareBBKV|fam1|fam|thrpt|5|34853758|± 181728.193|ops/s|-4.00%| |BenchmarkForNormal.new_compareBBKV|fam1|fam1|thrpt|5|23348288|± 210662.654|ops/s|-2.00%| |BenchmarkForNormal.new_compareKV|fam|fam|thrpt|5|23545532|± 300722.638|ops/s|0.00%| |BenchmarkForNormal.new_compareKV|fam|fam1|thrpt|5|36379792|± 103787.745|ops/s|0.00%| |BenchmarkForNormal.new_compareKV|fam1|fam|thrpt|5|36389642|± 215220.231|ops/s|0.00%| |BenchmarkForNormal.new_compareKV|fam1|fam1|thrpt|5|22781448|± 334278.380|ops/s|1.00%| |BenchmarkForNormal.new_compareKVVsBBKV|fam|fam|thrpt|5|24419066|± 178926.313|ops/s|-3.00%| |BenchmarkForNormal.new_compareKVVsBBKV|fam|fam1|thrpt|5|39477917|± 116925.262|ops/s|0.00%| |BenchmarkForNormal.new_compareKVVsBBKV|fam1|fam|thrpt|5|39381461|± 196771.635|ops/s|0.00%| |BenchmarkForNormal.new_compareKVVsBBKV|fam1|fam1|thrpt|5|23624400|± 402220.882|ops/s|-3.00%| |BenchmarkForNormal.old_compareBBKV|fam|fam|thrpt|5|24485218|± 95396.313|ops/s| | |BenchmarkForNormal.old_compareBBKV|fam|fam1|thrpt|5|36419133|± 159715.504|ops/s| | |BenchmarkForNormal.old_compareBBKV|fam1|fam|thrpt|5|36422202|± 86247.635|ops/s| | |BenchmarkForNormal.old_compareBBKV|fam1|fam1|thrpt|5|23734773|± 210072.328|ops/s| | |BenchmarkForNormal.old_compareKV|fam|fam|thrpt|5|23534333|± 57022.884|ops/s| | |BenchmarkForNormal.old_compareKV|fam|fam1|thrpt|5|36247387|± 95109.893|ops/s| | |BenchmarkForNormal.old_compareKV|fam1|fam|thrpt|5|36260304|± 63266.840|ops/s| | |BenchmarkForNormal.old_compareKV|fam1|fam1|thrpt|5|22582939|± 50450.874|ops/s| | |BenchmarkForNormal.old_compareKVVsBBKV|fam|fam|thrpt|5|25144704|± 291029.655|ops/s| | |BenchmarkForNormal.old_compareKVVsBBKV|fam|fam1|thrpt|5|39326233|± 218927.739|ops/s| | |BenchmarkForNormal.old_compareKVVsBBKV|fam1|fam|thrpt|5|39297932|± 487026.618|ops/s| | |BenchmarkForNormal.old_compareKVVsBBKV|fam1|fam1|thrpt|5|24395274|± 149132.118|ops/s| | was (Author: filtertip): Perf test report, copied from PR.(see PerfTestCellComparator.java, set compareCnt as 1 billion) |compareMethod|leftFamLen|rightFamLen|comparator|cost(ms)|diff| |compareKV|0|0|CellComparatorImpl|28850| |
[jira] [Commented] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714630#comment-17714630 ] Zheng Wang commented on HBASE-27805: Ok. [~zhangduo] > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in G1, humongous objects are objects larger or equal the size of half a > region, and the heapRegionSize is automatically calculated based on the heap > size parameter if not explicitly specified. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Description: The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2m become memory fragement. Lots of memory fragement may lead to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half of heapRegionSize), there was no repeat of the above. BTW, in G1, humongous objects are objects larger or equal the size of half a region, and the heapRegionSize is automatically calculated based on the heap size parameter if not explicitly specified. was: The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2m become memory fragement. Lots of memory fragement may lead to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half of heapRegionSize), there was no repeat of the above. BTW, in g1, humongous objects are objects larger or equal the size of half a region. > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in G1, humongous objects are objects larger or equal the size of half a > region, and the heapRegionSize is automatically calculated based on the heap > size parameter if not explicitly specified. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714381#comment-17714381 ] Zheng Wang commented on HBASE-27805: Not sure if we should change the default chunk size to 2047k, any suggestions are welcomed. > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in g1, humongous objects are objects larger or equal the size of half a > region. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Description: The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2m become memory fragement. Lots of memory fragement may lead to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half of heapRegionSize), there was no repeat of the above. BTW, in g1, humongous objects are objects larger or equal the size of half a region. was: The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2m become memory fragement. Lots of memory fragement may leading to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half of heapRegionSize), there was no repeat of the above. BTW, in g1, humongous objects are objects larger or equal the size of half a region. > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may lead to fullgc even if the percent of used heap > not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in g1, humongous objects are objects larger or equal the size of half a > region. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and lead to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Summary: The chunk created by mslab may cause memory fragement and lead to fullgc (was: The chunk created by mslab may cause memory fragement and leading to fullgc) > The chunk created by mslab may cause memory fragement and lead to fullgc > > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may leading to fullgc even if the percent of used > heap not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in g1, humongous objects are objects larger or equal the size of half a > region. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and leading to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Description: The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2m become memory fragement. Lots of memory fragement may leading to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half of heapRegionSize), there was no repeat of the above. BTW, in g1, humongous objects are objects larger or equal the size of half a region. was: The default size of chunk is 2MB, when we use G1, if heapRegionSize equals 4MB, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2MB become memory fragement. Lots of memory fragement may leading to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k, there was no repeat of the above. > The chunk created by mslab may cause memory fragement and leading to fullgc > --- > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2m, when we use G1, if heapRegionSize equals 4m, > these chunks are allocated as humongous objects, exclusively allocating one > region, then the remaining 2m become memory fragement. > Lots of memory fragement may leading to fullgc even if the percent of used > heap not high enough. > I have tested to reduce the chunk size to 2047k(2m-1k, a bit lesser than half > of heapRegionSize), there was no repeat of the above. > BTW, in g1, humongous objects are objects larger or equal the size of half a > region. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and leading to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Description: The default size of chunk is 2MB, when we use G1, if heapRegionSize equals 4MB, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2MB become memory fragement. Lots of memory fragement may leading to fullgc even if the percent of used heap not high enough. I have tested to reduce the chunk size to 2047k, there was no repeat of the above. was: The default size of chunk is 2MB, when we use G1, if heapRegionSize equals 4MB, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2MB become memory fragement. Lots of memory fragement may leading to fullgc even if the percent of used heap not high enough. > The chunk created by mslab may cause memory fragement and leading to fullgc > --- > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2MB, when we use G1, if heapRegionSize equals > 4MB, these chunks are allocated as humongous objects, exclusively allocating > one region, then the remaining 2MB become memory fragement. > Lots of memory fragement may leading to fullgc even if the percent of used > heap not high enough. > I have tested to reduce the chunk size to 2047k, there was no repeat of the > above. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and leading to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Attachment: chunksize-2047k.png > The chunk created by mslab may cause memory fragement and leading to fullgc > --- > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2047k.png, chunksize-2048k-fullgc.png > > > The default size of chunk is 2MB, when we use G1, if heapRegionSize equals > 4MB, these chunks are allocated as humongous objects, exclusively allocating > one region, then the remaining 2MB become memory fragement. > Lots of memory fragement may leading to fullgc even if the percent of used > heap not high enough. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27805) The chunk created by mslab may cause memory fragement and leading to fullgc
[ https://issues.apache.org/jira/browse/HBASE-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27805: --- Attachment: chunksize-2048k-fullgc.png > The chunk created by mslab may cause memory fragement and leading to fullgc > --- > > Key: HBASE-27805 > URL: https://issues.apache.org/jira/browse/HBASE-27805 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: chunksize-2048k-fullgc.png > > > The default size of chunk is 2MB, when we use G1, if heapRegionSize equals > 4MB, these chunks are allocated as humongous objects, exclusively allocating > one region, then the remaining 2MB become memory fragement. > Lots of memory fragement may leading to fullgc even if the percent of used > heap not high enough. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27805) The chunk created by mslab may cause memory fragement and leading to fullgc
Zheng Wang created HBASE-27805: -- Summary: The chunk created by mslab may cause memory fragement and leading to fullgc Key: HBASE-27805 URL: https://issues.apache.org/jira/browse/HBASE-27805 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Zheng Wang Assignee: Zheng Wang The default size of chunk is 2MB, when we use G1, if heapRegionSize equals 4MB, these chunks are allocated as humongous objects, exclusively allocating one region, then the remaining 2MB become memory fragement. Lots of memory fragement may leading to fullgc even if the percent of used heap not high enough. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712587#comment-17712587 ] Zheng Wang commented on HBASE-27788: A writing test case with PE: nohup hbase pe --table=TestTable1 --nomapred --oneCon=true --valueSize=10 --rows=100 --columns=200 --autoFlush=true --presplit=10 --multiPut=100 --writeToWAL=false sequentialWrite 1 2>&1 > nohup.out & //before patch Finished TestClient-0 in 396657ms over 100 rows //after patch Finished TestClient-0 in 356932ms over 100 rows > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712226#comment-17712226 ] Zheng Wang edited comment on HBASE-27788 at 4/14/23 7:50 AM: - Perf test report, copied from PR.(see PerfTestCellComparator.java, set compareCnt as 1 billion) |compareMethod|leftFamLen|rightFamLen|comparator|cost(ms)|diff| |compareKV|0|0|CellComparatorImpl|28850| | |compareKV|0|0|InnerStoreCellComparator|27478|-5.00%| |compareKV|0|4|CellComparatorImpl|19041| | |compareKV|0|4|InnerStoreCellComparator|17391|-9.00%| |compareKV|4|0|CellComparatorImpl|18988| | |compareKV|4|0|InnerStoreCellComparator|17375|-8.00%| |compareKV|4|4|CellComparatorImpl|33360| | |compareKV|4|4|InnerStoreCellComparator|27083|-19.00%| |compareBBKV|0|0|CellComparatorImpl|34014| | |compareBBKV|0|0|InnerStoreCellComparator|31660|-7.00%| |compareBBKV|0|4|CellComparatorImpl|20780| | |compareBBKV|0|4|InnerStoreCellComparator|20847|0.00%| |compareBBKV|4|0|CellComparatorImpl|23540| | |compareBBKV|4|0|InnerStoreCellComparator|21751|-8.00%| |compareBBKV|4|4|CellComparatorImpl|40192| | |compareBBKV|4|4|InnerStoreCellComparator|31522|-22.00%| |compareKVVsBBKV|0|0|CellComparatorImpl|30979| | |compareKVVsBBKV|0|0|InnerStoreCellComparator|29827|-4.00%| |compareKVVsBBKV|0|4|CellComparatorImpl|21918| | |compareKVVsBBKV|0|4|InnerStoreCellComparator|19143|-13.00%| |compareKVVsBBKV|4|0|CellComparatorImpl|22605| | |compareKVVsBBKV|4|0|InnerStoreCellComparator|20952|-7.00%| |compareKVVsBBKV|4|4|CellComparatorImpl|35561| | |compareKVVsBBKV|4|4|InnerStoreCellComparator|29150|-18.00%| was (Author: filtertip): Perf test report(set compareCnt as 1 billion), copied from PR. |compareMethod|leftFamLen|rightFamLen|comparator|cost(ms)|diff| |compareKV|0|0|CellComparatorImpl|28850| | |compareKV|0|0|InnerStoreCellComparator|27478|-5.00%| |compareKV|0|4|CellComparatorImpl|19041| | |compareKV|0|4|InnerStoreCellComparator|17391|-9.00%| |compareKV|4|0|CellComparatorImpl|18988| | |compareKV|4|0|InnerStoreCellComparator|17375|-8.00%| |compareKV|4|4|CellComparatorImpl|33360| | |compareKV|4|4|InnerStoreCellComparator|27083|-19.00%| |compareBBKV|0|0|CellComparatorImpl|34014| | |compareBBKV|0|0|InnerStoreCellComparator|31660|-7.00%| |compareBBKV|0|4|CellComparatorImpl|20780| | |compareBBKV|0|4|InnerStoreCellComparator|20847|0.00%| |compareBBKV|4|0|CellComparatorImpl|23540| | |compareBBKV|4|0|InnerStoreCellComparator|21751|-8.00%| |compareBBKV|4|4|CellComparatorImpl|40192| | |compareBBKV|4|4|InnerStoreCellComparator|31522|-22.00%| |compareKVVsBBKV|0|0|CellComparatorImpl|30979| | |compareKVVsBBKV|0|0|InnerStoreCellComparator|29827|-4.00%| |compareKVVsBBKV|0|4|CellComparatorImpl|21918| | |compareKVVsBBKV|0|4|InnerStoreCellComparator|19143|-13.00%| |compareKVVsBBKV|4|0|CellComparatorImpl|22605| | |compareKVVsBBKV|4|0|InnerStoreCellComparator|20952|-7.00%| |compareKVVsBBKV|4|4|CellComparatorImpl|35561| | |compareKVVsBBKV|4|4|InnerStoreCellComparator|29150|-18.00%| > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27788) Skip family comparing when compare cells inner the store
[ https://issues.apache.org/jira/browse/HBASE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712226#comment-17712226 ] Zheng Wang commented on HBASE-27788: Perf test report(set compareCnt as 1 billion), copied from PR. |compareMethod|leftFamLen|rightFamLen|comparator|cost(ms)|diff| |compareKV|0|0|CellComparatorImpl|28850| | |compareKV|0|0|InnerStoreCellComparator|27478|-5.00%| |compareKV|0|4|CellComparatorImpl|19041| | |compareKV|0|4|InnerStoreCellComparator|17391|-9.00%| |compareKV|4|0|CellComparatorImpl|18988| | |compareKV|4|0|InnerStoreCellComparator|17375|-8.00%| |compareKV|4|4|CellComparatorImpl|33360| | |compareKV|4|4|InnerStoreCellComparator|27083|-19.00%| |compareBBKV|0|0|CellComparatorImpl|34014| | |compareBBKV|0|0|InnerStoreCellComparator|31660|-7.00%| |compareBBKV|0|4|CellComparatorImpl|20780| | |compareBBKV|0|4|InnerStoreCellComparator|20847|0.00%| |compareBBKV|4|0|CellComparatorImpl|23540| | |compareBBKV|4|0|InnerStoreCellComparator|21751|-8.00%| |compareBBKV|4|4|CellComparatorImpl|40192| | |compareBBKV|4|4|InnerStoreCellComparator|31522|-22.00%| |compareKVVsBBKV|0|0|CellComparatorImpl|30979| | |compareKVVsBBKV|0|0|InnerStoreCellComparator|29827|-4.00%| |compareKVVsBBKV|0|4|CellComparatorImpl|21918| | |compareKVVsBBKV|0|4|InnerStoreCellComparator|19143|-13.00%| |compareKVVsBBKV|4|0|CellComparatorImpl|22605| | |compareKVVsBBKV|4|0|InnerStoreCellComparator|20952|-7.00%| |compareKVVsBBKV|4|4|CellComparatorImpl|35561| | |compareKVVsBBKV|4|4|InnerStoreCellComparator|29150|-18.00%| > Skip family comparing when compare cells inner the store > > > Key: HBASE-27788 > URL: https://issues.apache.org/jira/browse/HBASE-27788 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Currently we use CellComparatorImpl to compare cells, it compare row first, > then family, then qulifier and so on. > If the comparing inner the store, the families are always equal(unless the > familyLength is zero for special purpose), so this step could be skipped for > better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27788) Skip family comparing when compare cells inner the store
Zheng Wang created HBASE-27788: -- Summary: Skip family comparing when compare cells inner the store Key: HBASE-27788 URL: https://issues.apache.org/jira/browse/HBASE-27788 Project: HBase Issue Type: Improvement Components: Performance Reporter: Zheng Wang Assignee: Zheng Wang Currently we use CellComparatorImpl to compare cells, it compare row first, then family, then qulifier and so on. If the comparing inner the store, the families are always equal(unless the familyLength is zero for special purpose), so this step could be skipped for better performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27765) Add biggest cell related info into web ui
[ https://issues.apache.org/jira/browse/HBASE-27765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708982#comment-17708982 ] Zheng Wang commented on HBASE-27765: Filled the release note yet. Thanks a lot for the review and push. [~zhangduo] > Add biggest cell related info into web ui > - > > Key: HBASE-27765 > URL: https://issues.apache.org/jira/browse/HBASE-27765 > Project: HBase > Issue Type: Improvement > Components: HFile, UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > Attachments: screenshot-1.png, screenshot-2.png > > > There are some disadvantages to large cell, such as can't be cached or cause > memory fragmentation, but currently user can't easily to find them out. > My proposal is save len and key of the biggest cell into fileinfo of hfile, > and shown on web ui, including two places. > 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can > find out which regions has large cell by sorting. > 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in > here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27765) Add biggest cell related info into web ui
[ https://issues.apache.org/jira/browse/HBASE-27765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27765: --- Release Note: Save len and key of the biggest cell into fileinfo when generate hfile, and shows them on webui for better monitor. > Add biggest cell related info into web ui > - > > Key: HBASE-27765 > URL: https://issues.apache.org/jira/browse/HBASE-27765 > Project: HBase > Issue Type: Improvement > Components: HFile, UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > Attachments: screenshot-1.png, screenshot-2.png > > > There are some disadvantages to large cell, such as can't be cached or cause > memory fragmentation, but currently user can't easily to find them out. > My proposal is save len and key of the biggest cell into fileinfo of hfile, > and shown on web ui, including two places. > 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can > find out which regions has large cell by sorting. > 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in > here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27765) Add biggest cell related info into web ui
[ https://issues.apache.org/jira/browse/HBASE-27765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27765: --- Fix Version/s: 3.0.0-alpha-4 > Add biggest cell related info into web ui > - > > Key: HBASE-27765 > URL: https://issues.apache.org/jira/browse/HBASE-27765 > Project: HBase > Issue Type: Improvement > Components: HFile, UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 3.0.0-alpha-4 > > Attachments: screenshot-1.png, screenshot-2.png > > > There are some disadvantages to large cell, such as can't be cached or cause > memory fragmentation, but currently user can't easily to find them out. > My proposal is save len and key of the biggest cell into fileinfo of hfile, > and shown on web ui, including two places. > 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can > find out which regions has large cell by sorting. > 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in > here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27765) Add biggest cell related info into web ui
[ https://issues.apache.org/jira/browse/HBASE-27765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27765: --- Summary: Add biggest cell related info into web ui (was: Add biggest cell related info web ui) > Add biggest cell related info into web ui > - > > Key: HBASE-27765 > URL: https://issues.apache.org/jira/browse/HBASE-27765 > Project: HBase > Issue Type: Improvement > Components: HFile, UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There are some disadvantages to large cell, such as can't be cached or cause > memory fragmentation, but currently user can't easily to find them out. > My proposal is save len and key of the biggest cell into fileinfo of hfile, > and shown on web ui, including two places. > 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can > find out which regions has large cell by sorting. > 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in > here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27765) Add biggest cell related info web ui
[ https://issues.apache.org/jira/browse/HBASE-27765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27765: --- Attachment: screenshot-2.png > Add biggest cell related info web ui > > > Key: HBASE-27765 > URL: https://issues.apache.org/jira/browse/HBASE-27765 > Project: HBase > Issue Type: Improvement > Components: HFile, UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There are some disadvantages to large cell, such as can't be cached or cause > memory fragmentation, but currently user can't easily to find them out. > My proposal is save len and key of the biggest cell into fileinfo of hfile, > and shown on web ui, including two places. > 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can > find out which regions has large cell by sorting. > 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in > here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27765) Add biggest cell related info web ui
[ https://issues.apache.org/jira/browse/HBASE-27765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-27765: --- Attachment: screenshot-1.png > Add biggest cell related info web ui > > > Key: HBASE-27765 > URL: https://issues.apache.org/jira/browse/HBASE-27765 > Project: HBase > Issue Type: Improvement > Components: HFile, UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There are some disadvantages to large cell, such as can't be cached or cause > memory fragmentation, but currently user can't easily to find them out. > My proposal is save len and key of the biggest cell into fileinfo of hfile, > and shown on web ui, including two places. > 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can > find out which regions has large cell by sorting. > 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in > here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27765) Add biggest cell related info web ui
Zheng Wang created HBASE-27765: -- Summary: Add biggest cell related info web ui Key: HBASE-27765 URL: https://issues.apache.org/jira/browse/HBASE-27765 Project: HBase Issue Type: Improvement Components: HFile, UI Reporter: Zheng Wang Assignee: Zheng Wang There are some disadvantages to large cell, such as can't be cached or cause memory fragmentation, but currently user can't easily to find them out. My proposal is save len and key of the biggest cell into fileinfo of hfile, and shown on web ui, including two places. 1: Add "Len Of Biggest Cell" into main page of regionServer, in here we can find out which regions has large cell by sorting. 2: Add "Len Of Biggest Cell" and "Key Of Biggest Cell" into region page, in here we can find out the exactly key and the hfile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer
[ https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533612#comment-17533612 ] Zheng Wang commented on HBASE-25768: [~Xiaolin Ha] Yeah, you are right, this patch is useful for some cases. > Support an overall coarse and fast balance strategy for StochasticLoadBalancer > -- > > Key: HBASE-25768 > URL: https://issues.apache.org/jira/browse/HBASE-25768 > Project: HBase > Issue Type: Improvement > Components: Balancer >Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > > When we use StochasticLoadBalancer + balanceByTable, we could face two > difficulties. > # For each table, their regions are distributed uniformly, but for the > overall cluster, still exiting imbalance between RSes; > # When there are large-scaled restart of RSes, or expansion for groups or > cluster, we hope the balancer can execute as soon as possible, but the > StochasticLoadBalancer may need a lot of time to compute costs. > We can detect these circumstances in StochasticLoadBalancer(such as using the > percentage of skew tables), and before the normal balance steps trying, we > can add a strategy to let it just balance like the SimpleLoadBalancer or use > few light cost functions here. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- External issue URL: (was: https://issues.apache.org/jira/browse/HBASE-8665) > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- External issue URL: https://issues.apache.org/jira/browse/HBASE-8665 > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531999#comment-17531999 ] Zheng Wang commented on HBASE-26987: BTW, this issue seems introduced by HBASE-8665. > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987 ] Zheng Wang deleted comment on HBASE-26987: was (Author: filtertip): Before this issue resolved, i think we should have a config to disable this feature. > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- Description: For some system compaction, we set the selectNow to false, so the file selecting will not be done until the compaction running, it brings side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger unexpected alarm. My approach is limit the compaction queue count, by compute the filesNotCompating and hbase.hstore.compaction.max. was: For some system compaction, we set the selectNow to false, so the file selecting will not be done until the compaction running, it brings side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger unexpected alarm. > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. > My approach is limit the compaction queue count, by compute the > filesNotCompating and hbase.hstore.compaction.max. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang reassigned HBASE-26987: -- Assignee: Zheng Wang > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- Description: For some system compaction, we set the selectNow to false, so the file selecting will not be done until the compaction running, it brings side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger unexpected alarm. was: For some system compaction, we set the selectNow to false, so the file selecting will not be done until the compaction running, it brings side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger wrong alarm. > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger unexpected alarm. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- Description: For some system compaction, we set the selectNow to false, so the file selecting will not be done until the compaction running, it brings side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger wrong alarm. was: For some system compaction, we set the selectNow to false, so the file selecting will be done until the compaction running, but it brings a side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger wrong alarm. > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will not be done until the compaction running, it brings side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger wrong alarm. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue grows too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- Summary: The length of compact queue grows too big when the compacting is slow (was: The length of compact queue too big when the compacting is slow) > The length of compact queue grows too big when the compacting is slow > - > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will be done until the compaction running, but it brings a side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger wrong alarm. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-25768) Support an overall coarse and fast balance strategy for StochasticLoadBalancer
[ https://issues.apache.org/jira/browse/HBASE-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529777#comment-17529777 ] Zheng Wang commented on HBASE-25768: I encountered similar issue recently, a cluster has 1000+ table, when i enable balanceByTable, it spend several hours to do the balance, finally i disable it, and set hbase.master.balancer.stochastic.tableSkewCost to 1000 instead, it works well. > Support an overall coarse and fast balance strategy for StochasticLoadBalancer > -- > > Key: HBASE-25768 > URL: https://issues.apache.org/jira/browse/HBASE-25768 > Project: HBase > Issue Type: Improvement > Components: Balancer >Affects Versions: 3.0.0-alpha-1, 2.0.0, 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > > When we use StochasticLoadBalancer + balanceByTable, we could face two > difficulties. > # For each table, their regions are distributed uniformly, but for the > overall cluster, still exiting imbalance between RSes; > # When there are large-scaled restart of RSes, or expansion for groups or > cluster, we hope the balancer can execute as soon as possible, but the > StochasticLoadBalancer may need a lot of time to compute costs. > We can detect these circumstances in StochasticLoadBalancer(such as using the > percentage of skew tables), and before the normal balance steps trying, we > can add a strategy to let it just balance like the SimpleLoadBalancer or use > few light cost functions here. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- Description: For some system compaction, we set the selectNow to false, so the file selecting will be done until the compaction running, but it brings a side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. An example shows at attachments, there are 154 regions and about 2000 hfiles, but the length of compact queue grows to 1391, it cause confusion and may trigger wrong alarm. was:For some system compaction, we set the selectNow to false, so the file selecting will be done until the compaction running, but it brings a side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. > The length of compact queue too big when the compacting is slow > --- > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will be done until the compaction running, but it brings a side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. > An example shows at attachments, there are 154 regions and about 2000 hfiles, > but the length of compact queue grows to 1391, it cause confusion and may > trigger wrong alarm. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-26987) The length of compact queue too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529724#comment-17529724 ] Zheng Wang commented on HBASE-26987: Before this issue resolved, i think we should have a config to disable this feature. > The length of compact queue too big when the compacting is slow > --- > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will be done until the compaction running, but it brings a side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HBASE-26987) The length of compact queue too big when the compacting is slow
[ https://issues.apache.org/jira/browse/HBASE-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26987: --- Summary: The length of compact queue too big when the compacting is slow (was: The length of compact queue will be wrong when the compacting is slow) > The length of compact queue too big when the compacting is slow > --- > > Key: HBASE-26987 > URL: https://issues.apache.org/jira/browse/HBASE-26987 > Project: HBase > Issue Type: Improvement >Reporter: Zheng Wang >Priority: Major > Attachments: image-2022-04-29-10-26-09-351.png, > image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png > > > For some system compaction, we set the selectNow to false, so the file > selecting will be done until the compaction running, but it brings a side > effect, if another compacting is slow, we may put lots of compaction to > queue, because the filesCompacting of Hstore is empty in the meantime. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-26987) The length of compact queue will be wrong when the compacting is slow
Zheng Wang created HBASE-26987: -- Summary: The length of compact queue will be wrong when the compacting is slow Key: HBASE-26987 URL: https://issues.apache.org/jira/browse/HBASE-26987 Project: HBase Issue Type: Improvement Reporter: Zheng Wang Attachments: image-2022-04-29-10-26-09-351.png, image-2022-04-29-10-26-18-323.png, image-2022-04-29-10-26-24-087.png For some system compaction, we set the selectNow to false, so the file selecting will be done until the compaction running, but it brings a side effect, if another compacting is slow, we may put lots of compaction to queue, because the filesCompacting of Hstore is empty in the meantime. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516864#comment-17516864 ] Zheng Wang commented on HBASE-26885: Thanks for the review and push. [~zhangduo] > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12 > > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, 'localhost,1,1' is the bogus server: > {code:java} > 2022-03-22 10:17:48,301 WARN [PEWorker-8] > assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, > ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => > 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', > STARTKEY => '002497747', ENDKEY => ''} to server > localhost,1,1, this usually because the server is alread dead, give up and > mark the procedure as complete, the parent procedure will take care of this. > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; > pid=17952, ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add > procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th > rollback step {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514473#comment-17514473 ] Zheng Wang commented on HBASE-26885: Add an addendum PR. We should throw exception instead of return directly, avoid execute frequently. > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12 > > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, 'localhost,1,1' is the bogus server: > {code:java} > 2022-03-22 10:17:48,301 WARN [PEWorker-8] > assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, > ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => > 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', > STARTKEY => '002497747', ENDKEY => ''} to server > localhost,1,1, this usually because the server is alread dead, give up and > mark the procedure as complete, the parent procedure will take care of this. > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; > pid=17952, ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add > procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th > rollback step {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Reopened] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang reopened HBASE-26885: > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12 > > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, 'localhost,1,1' is the bogus server: > {code:java} > 2022-03-22 10:17:48,301 WARN [PEWorker-8] > assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, > ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => > 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', > STARTKEY => '002497747', ENDKEY => ''} to server > localhost,1,1, this usually because the server is alread dead, give up and > mark the procedure as complete, the parent procedure will take care of this. > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; > pid=17952, ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add > procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th > rollback step {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513767#comment-17513767 ] Zheng Wang commented on HBASE-26885: Pushed to 2.4+, thanks for all the comment. [~vjasani] [~zhangduo] [~pankajkumar] > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12 > > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, 'localhost,1,1' is the bogus server: > {code:java} > 2022-03-22 10:17:48,301 WARN [PEWorker-8] > assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, > ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => > 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', > STARTKEY => '002497747', ENDKEY => ''} to server > localhost,1,1, this usually because the server is alread dead, give up and > mark the procedure as complete, the parent procedure will take care of this. > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; > pid=17952, ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add > procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th > rollback step {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-26885. Fix Version/s: 2.5.0 2.6.0 3.0.0-alpha-3 2.4.12 Resolution: Fixed > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12 > > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, 'localhost,1,1' is the bogus server: > {code:java} > 2022-03-22 10:17:48,301 WARN [PEWorker-8] > assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, > ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => > 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', > STARTKEY => '002497747', ENDKEY => ''} to server > localhost,1,1, this usually because the server is alread dead, give up and > mark the procedure as complete, the parent procedure will take care of this. > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; > pid=17952, ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add > procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th > rollback step {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26885: --- Description: Currently it will submit lots of unnecessary OpenRegionProcedure by retry. Related log looks like below, 'localhost,1,1' is the bogus server: {code:java} 2022-03-22 10:17:48,301 WARN [PEWorker-8] assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, ppid=17951, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', STARTKEY => '002497747', ENDKEY => ''} to server localhost,1,1, this usually because the server is alread dead, give up and mark the procedure as complete, the parent procedure will take care of this. org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; pid=17952, ppid=17951, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th rollback step {code} was: Currently it will submit lots of unnecessary OpenRegionProcedure by retry. Related log looks like below, "localhost,1,1" is the bogus server: {code:java} 2022-03-22 10:17:48,301 WARN [PEWorker-8] assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, ppid=17951, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', STARTKEY => '002497747', ENDKEY => ''} to server localhost,1,1, this usually because the server is alread dead, give up and mark the procedure as complete, the parent procedure will take care of this. org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; pid=17952, ppid=17951, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th rollback step {code} > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, 'localhost,1,1' is the bogus server: > {code:java} > 2022-03-22
[jira] [Updated] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
[ https://issues.apache.org/jira/browse/HBASE-26885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26885: --- Description: Currently it will submit lots of unnecessary OpenRegionProcedure by retry. Related log looks like below, "localhost,1,1" is the bogus server: {code:java} 2022-03-22 10:17:48,301 WARN [PEWorker-8] assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, ppid=17951, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', STARTKEY => '002497747', ENDKEY => ''} to server localhost,1,1, this usually because the server is alread dead, give up and mark the procedure as complete, the parent procedure will take care of this. org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; pid=17952, ppid=17951, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th rollback step {code} was:Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > The TRSP should not go on when it get a bogus server name from AM > - > > Key: HBASE-26885 > URL: https://issues.apache.org/jira/browse/HBASE-26885 > Project: HBase > Issue Type: Improvement > Components: proc-v2 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Currently it will submit lots of unnecessary OpenRegionProcedure by retry. > Related log looks like below, "localhost,1,1" is the bogus server: > {code:java} > 2022-03-22 10:17:48,301 WARN [PEWorker-8] > assignment.RegionRemoteProcedureBase: Can not add remote operation pid=17952, > ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 490391c232c7aa13f7e0d50bfe1f7235, NAME => > 'TestTable1,002497747,1647568640784.490391c232c7aa13f7e0d50bfe1f7235.', > STARTKEY => '002497747', ENDKEY => ''} to server > localhost,1,1, this usually because the server is alread dead, give up and > mark the procedure as complete, the parent procedure will take care of this. > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: localhost,1,1; > pid=17952, ppid=17951, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:168) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2022-03-22 10:17:48,301 DEBUG [PEWorker-8] procedure2.RootProcedureState: Add > procedure pid=17952, ppid=17951, state=SUCCESS, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 8th > rollback step {code} -- This message was sent
[jira] [Comment Edited] (HBASE-26884) Find unavailable regions by the startcode checking on hmaster start up and reassign them
[ https://issues.apache.org/jira/browse/HBASE-26884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513293#comment-17513293 ] Zheng Wang edited comment on HBASE-26884 at 3/28/22, 10:32 AM: --- Seen this in 2.0(cdh6.0.1, at about 1 years ago) and 2.2.0 rencently, not sure it could happen without misoperation by user. BTW, all that cases are along with server crash. [~anoop.hbase] was (Author: filtertip): Seen this in 2.0(cdh6.0.1, at about 1 years ago) and 2.2.0 rencently, not sure it could happen without misoperation by user. [~anoop.hbase] > Find unavailable regions by the startcode checking on hmaster start up and > reassign them > > > Key: HBASE-26884 > URL: https://issues.apache.org/jira/browse/HBASE-26884 > Project: HBase > Issue Type: Improvement > Components: master >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Sometimes we have seen there are regions in open or opening state, but does > not deployed on any rs and without procs for them, and after checking the > meta table, we find these startcode are expired. > It is no easy to reproduce, may be caused by corner bug or user misoperation. > My approach is add some checking on hmaster start up, if the startcode of the > regionLocation expired, and neither TRSP on region nor SCP on regionserver, > then we should reassign the region, then we can resovle it easily just by > restart hmaster. > Hbck2 maybe also useful for some of them cases, but not easily for common > user to use, especially the number of these regions not small and need to be > recovery quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HBASE-26884) Find unavailable regions by the startcode checking on hmaster start up and reassign them
[ https://issues.apache.org/jira/browse/HBASE-26884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26884: --- Description: Sometimes we have seen there are regions in open or opening state, but does not deployed on any rs and without procs for them, and after checking the meta table, we find these startcode are expired. It is no easy to reproduce, may be caused by corner bug or user misoperation. My approach is add some checking on hmaster start up, if the startcode of the regionLocation expired, and neither TRSP on region nor SCP on regionserver, then we should reassign the region, then we can resovle it easily just by restart hmaster. Hbck2 maybe also useful for some of them cases, but not easily for common user to use, especially the number of these regions not small and need to be recovery quickly. was: Sometimes we have seen there are regions in open or opening state, but does not deployed on any rs and without procs for them, and afting checking the meta table, we find these startcode are expired. It is no easy to reproduce, may be caused by corner bug or user misoperation. My approach is add some checking on hmaster start up, if the startcode of the regionLocation expired, and neither TRSP on region nor SCP on regionserver, then we should reassign the region, then we can resovle it easily just by restart hmaster. Hbck2 maybe also useful for some of them cases, but not easily for common user to use, especially the number of these regions not small and need to be recovery quickly. > Find unavailable regions by the startcode checking on hmaster start up and > reassign them > > > Key: HBASE-26884 > URL: https://issues.apache.org/jira/browse/HBASE-26884 > Project: HBase > Issue Type: Improvement > Components: master >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Sometimes we have seen there are regions in open or opening state, but does > not deployed on any rs and without procs for them, and after checking the > meta table, we find these startcode are expired. > It is no easy to reproduce, may be caused by corner bug or user misoperation. > My approach is add some checking on hmaster start up, if the startcode of the > regionLocation expired, and neither TRSP on region nor SCP on regionserver, > then we should reassign the region, then we can resovle it easily just by > restart hmaster. > Hbck2 maybe also useful for some of them cases, but not easily for common > user to use, especially the number of these regions not small and need to be > recovery quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26884) Find unavailable regions by the startcode checking on hmaster start up and reassign them
[ https://issues.apache.org/jira/browse/HBASE-26884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513293#comment-17513293 ] Zheng Wang commented on HBASE-26884: Seen this in 2.0(cdh6.0.1, at about 1 years ago) and 2.2.0 rencently, not sure it could happen without misoperation by user. [~anoop.hbase] > Find unavailable regions by the startcode checking on hmaster start up and > reassign them > > > Key: HBASE-26884 > URL: https://issues.apache.org/jira/browse/HBASE-26884 > Project: HBase > Issue Type: Improvement > Components: master >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > Sometimes we have seen there are regions in open or opening state, but does > not deployed on any rs and without procs for them, and afting checking the > meta table, we find these startcode are expired. > It is no easy to reproduce, may be caused by corner bug or user misoperation. > My approach is add some checking on hmaster start up, if the startcode of the > regionLocation expired, and neither TRSP on region nor SCP on regionserver, > then we should reassign the region, then we can resovle it easily just by > restart hmaster. > Hbck2 maybe also useful for some of them cases, but not easily for common > user to use, especially the number of these regions not small and need to be > recovery quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26885) The TRSP should not go on when it get a bogus server name from AM
Zheng Wang created HBASE-26885: -- Summary: The TRSP should not go on when it get a bogus server name from AM Key: HBASE-26885 URL: https://issues.apache.org/jira/browse/HBASE-26885 Project: HBase Issue Type: Improvement Components: proc-v2 Reporter: Zheng Wang Assignee: Zheng Wang Currently it will submit lots of unnecessary OpenRegionProcedure by retry. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26884) Find unavailable regions by the startcode checking on hmaster start up and reassign them
Zheng Wang created HBASE-26884: -- Summary: Find unavailable regions by the startcode checking on hmaster start up and reassign them Key: HBASE-26884 URL: https://issues.apache.org/jira/browse/HBASE-26884 Project: HBase Issue Type: Improvement Components: master Reporter: Zheng Wang Assignee: Zheng Wang Sometimes we have seen there are regions in open or opening state, but does not deployed on any rs and without procs for them, and afting checking the meta table, we find these startcode are expired. It is no easy to reproduce, may be caused by corner bug or user misoperation. My approach is add some checking on hmaster start up, if the startcode of the regionLocation expired, and neither TRSP on region nor SCP on regionserver, then we should reassign the region, then we can resovle it easily just by restart hmaster. Hbck2 maybe also useful for some of them cases, but not easily for common user to use, especially the number of these regions not small and need to be recovery quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-26027. Resolution: Fixed > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.5.0, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456906#comment-17456906 ] Zheng Wang commented on HBASE-26027: Merged to branch-2, thanks for the review. [~apurtell] > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.5.0, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26027: --- Fix Version/s: 2.6.0 > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.5.0, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.6.0 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456137#comment-17456137 ] Zheng Wang edited comment on HBASE-26027 at 12/9/21, 4:33 AM: -- Pushed a new PR #3925, just for branch-2. [~apurtell] was (Author: filtertip): Pushed a new PR #3925, just for branch-2. And not plan to push 2.3.x and 2.4.x since they are already stable. [~apurtell] > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.5.0, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456137#comment-17456137 ] Zheng Wang edited comment on HBASE-26027 at 12/9/21, 4:29 AM: -- Pushed a new PR #3925, just for branch-2. And not plan to push 2.3.x and 2.4.x since they are already stable. [~apurtell] was (Author: filtertip): Pushed a new PR #3925, just for branch-2. And not plan to push 2.3.x and 2.4.x since they are already stable. > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.5.0, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message
[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456137#comment-17456137 ] Zheng Wang commented on HBASE-26027: Pushed a new PR #3925, just for branch-2. And not plan to push 2.3.x and 2.4.x since they are already stable. > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.5.0, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Work started] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-26027 started by Zheng Wang. -- > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.3.8, 2.4.9 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384823#comment-17384823 ] Zheng Wang commented on HBASE-26027: Will dig into it later, thanks for the finding. [~apurtell] > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.3.6, 2.4.6 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26036) DBB released too early in HRegion.get() and dirty data for some operations
[ https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373233#comment-17373233 ] Zheng Wang commented on HBASE-26036: [~Xiaolin Ha]. Ok, get it. > DBB released too early in HRegion.get() and dirty data for some operations > -- > > Key: HBASE-26036 > URL: https://issues.apache.org/jira/browse/HBASE-26036 > Project: HBase > Issue Type: Bug > Components: rpc >Affects Versions: 3.0.0-alpha-1, 2.0.0 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Critical > > Before HBASE-25187, we found there are regionserver JVM crashing problems on > our production clusters, the coredump infos are as follows, > {code:java} > Stack: [0x7f621ba8d000,0x7f621bb8e000], sp=0x7f621bb8c0e0, free > space=1020k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 > bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d] > J 22844 C2 > org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z > (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24] > J 17972 C2 > org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z > (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90] > J 26197 C2 > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse; > (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc] > J 26332 C2 > org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair; > (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868] > J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ > 0x7f6a60711a4c [0x7f6a60711000+0xa4c] > J 19656% C2 > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V > (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4] > j org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > {code} > I have made a UT to reproduce this error, it can occur 100%。 > After HBASE-25187,the check result of the checkAndMutate will be false, > because it read wrong/dirty data from the released ByteBuff. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-26036) DBB released too early in HRegion.get() and dirty data for some operations
[ https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373145#comment-17373145 ] Zheng Wang edited comment on HBASE-26036 at 7/2/21, 2:01 AM: - HBASE-25187 seems not related to this issue, should be HBASE-25981? [~Xiaolin Ha] was (Author: filtertip): HBASE-25187 seems not related to this issue, should be HBASE-25981? > DBB released too early in HRegion.get() and dirty data for some operations > -- > > Key: HBASE-26036 > URL: https://issues.apache.org/jira/browse/HBASE-26036 > Project: HBase > Issue Type: Bug > Components: rpc >Affects Versions: 3.0.0-alpha-1, 2.0.0 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Critical > > Before HBASE-25187, we found there are regionserver JVM crashing problems on > our production clusters, the coredump infos are as follows, > {code:java} > Stack: [0x7f621ba8d000,0x7f621bb8e000], sp=0x7f621bb8c0e0, free > space=1020k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 > bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d] > J 22844 C2 > org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z > (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24] > J 17972 C2 > org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z > (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90] > J 26197 C2 > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse; > (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc] > J 26332 C2 > org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair; > (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868] > J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ > 0x7f6a60711a4c [0x7f6a60711000+0xa4c] > J 19656% C2 > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V > (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4] > j org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > {code} > I have made a UT to reproduce this error, it can occur 100%。 > After HBASE-25187,the check result of the checkAndMutate will be false, > because it read wrong/dirty data from the released ByteBuff. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26036) DBB released too early in HRegion.get() and dirty data for some operations
[ https://issues.apache.org/jira/browse/HBASE-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373145#comment-17373145 ] Zheng Wang commented on HBASE-26036: HBASE-25187 seems not related to this issue, should be HBASE-25981? > DBB released too early in HRegion.get() and dirty data for some operations > -- > > Key: HBASE-26036 > URL: https://issues.apache.org/jira/browse/HBASE-26036 > Project: HBase > Issue Type: Bug > Components: rpc >Affects Versions: 3.0.0-alpha-1, 2.0.0 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Critical > > Before HBASE-25187, we found there are regionserver JVM crashing problems on > our production clusters, the coredump infos are as follows, > {code:java} > Stack: [0x7f621ba8d000,0x7f621bb8e000], sp=0x7f621bb8c0e0, free > space=1020k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > J 10829 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.getTimestamp()J (9 > bytes) @ 0x7f6a5ee11b2d [0x7f6a5ee11ae0+0x4d] > J 22844 C2 > org.apache.hadoop.hbase.regionserver.HRegion.doCheckAndRowMutate([B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/client/RowMutations;Lorg/apache/hadoop/hbase/client/Mutation;Z)Z > (540 bytes) @ 0x7f6a60bed144 [0x7f6a60beb320+0x1e24] > J 17972 C2 > org.apache.hadoop.hbase.regionserver.RSRpcServices.checkAndRowMutate(Lorg/apache/hadoop/hbase/regionserver/Region;Ljava/util/List;Lorg/apache/hadoop/hbase/CellScanner;[B[B[BLorg/apache/hadoop/hbase/filter/CompareFilter$CompareOp;Lorg/apache/hadoop/hbase/filter/ByteArrayComparable;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$RegionActionResult$Builder;)Z > (312 bytes) @ 0x7f6a5f4a7ed0 [0x7f6a5f4a6f40+0xf90] > J 26197 C2 > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(Lorg/apache/hbase/thirdparty/com/google/protobuf/RpcController;Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiRequest;)Lorg/apache/hadoop/hbase/shaded/protobuf/generated/ClientProtos$MultiResponse; > (644 bytes) @ 0x7f6a61538b0c [0x7f6a61537940+0x11cc] > J 26332 C2 > org.apache.hadoop.hbase.ipc.RpcServer.call(Lorg/apache/hadoop/hbase/ipc/RpcCall;Lorg/apache/hadoop/hbase/monitoring/MonitoredRPCHandler;)Lorg/apache/hadoop/hbase/util/Pair; > (566 bytes) @ 0x7f6a615e8228 [0x7f6a615e79c0+0x868] > J 20563 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1196 bytes) @ > 0x7f6a60711a4c [0x7f6a60711000+0xa4c] > J 19656% C2 > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V > (338 bytes) @ 0x7f6a6039a414 [0x7f6a6039a320+0xf4] > j org.apache.hadoop.hbase.ipc.RpcExecutor$1.run()V+24 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > {code} > I have made a UT to reproduce this error, it can occur 100%。 > After HBASE-25187,the check result of the checkAndMutate will be false, > because it read wrong/dirty data from the released ByteBuff. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372663#comment-17372663 ] Zheng Wang commented on HBASE-26027: Thanks for all the comments. [~reidchan] [~zhangduo] > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.3.6, 2.4.5 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-26027. Resolution: Fixed > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.3.6, 2.4.5 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26027: --- Fix Version/s: 2.4.5 2.3.6 2.5.0 > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.3.6, 2.4.5 > > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and if we need to put an exception > into it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26028) The view as json page shows exception when using TinyLfuBlockCache
[ https://issues.apache.org/jira/browse/HBASE-26028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26028: --- Fix Version/s: 2.4.5 2.3.6 2.5.0 > The view as json page shows exception when using TinyLfuBlockCache > -- > > Key: HBASE-26028 > URL: https://issues.apache.org/jira/browse/HBASE-26028 > Project: HBase > Issue Type: Bug > Components: UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 2.5.0, 2.3.6, 3.0.0-alpha-2, 2.4.5 > > Attachments: HBASE-26028-afterpatch.jpg, HBASE-26028-beforepatch.jpg > > > Some variable in TinyLfuBlockCache should be marked as transient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26028) The view as json page shows exception when using TinyLfuBlockCache
[ https://issues.apache.org/jira/browse/HBASE-26028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26028: --- Fix Version/s: 3.0.0-alpha-2 > The view as json page shows exception when using TinyLfuBlockCache > -- > > Key: HBASE-26028 > URL: https://issues.apache.org/jira/browse/HBASE-26028 > Project: HBase > Issue Type: Bug > Components: UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Fix For: 3.0.0-alpha-2 > > Attachments: HBASE-26028-afterpatch.jpg, HBASE-26028-beforepatch.jpg > > > Some variable in TinyLfuBlockCache should be marked as transient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-26028) The view as json page shows exception when using TinyLfuBlockCache
[ https://issues.apache.org/jira/browse/HBASE-26028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang resolved HBASE-26028. Resolution: Fixed > The view as json page shows exception when using TinyLfuBlockCache > -- > > Key: HBASE-26028 > URL: https://issues.apache.org/jira/browse/HBASE-26028 > Project: HBase > Issue Type: Bug > Components: UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: HBASE-26028-afterpatch.jpg, HBASE-26028-beforepatch.jpg > > > Some variable in TinyLfuBlockCache should be marked as transient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-26028) The view as json page shows exception when using TinyLfuBlockCache
[ https://issues.apache.org/jira/browse/HBASE-26028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372313#comment-17372313 ] Zheng Wang commented on HBASE-26028: Thanks for the reviewing. [~vjasani] > The view as json page shows exception when using TinyLfuBlockCache > -- > > Key: HBASE-26028 > URL: https://issues.apache.org/jira/browse/HBASE-26028 > Project: HBase > Issue Type: Bug > Components: UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: HBASE-26028-afterpatch.jpg, HBASE-26028-beforepatch.jpg > > > Some variable in TinyLfuBlockCache should be marked as transient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26027: --- Description: The batch api of HTable contains a param named results to store result or exception, its type is Object[]. If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and if we need to put an exception into it by some reason, then the ArrayStoreException will occur in AsyncRequestFutureImpl.updateResult, then the AsyncRequestFutureImpl.decActionCounter will be skipped, then in the AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the actionsInProgress again and again, forever. It is better to add an cutoff calculated by operationTimeout, instead of only depend on the value of actionsInProgress. BTW, this issue only for 2.x, since 3.x the implement has refactored. How to reproduce: 1: add sleep in RSRpcServices.multi to mock slow response {code:java} try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } {code} 2: set time out in config {code:java} conf.set("hbase.rpc.timeout","2000"); conf.set("hbase.client.operation.timeout","6000"); {code} 3: call batch api {code:java} Table table = HbaseUtil.getTable("test"); byte[] cf = Bytes.toBytes("f"); byte[] c = Bytes.toBytes("c1"); List gets = new ArrayList<>(); for (int i = 0; i < 10; i++) { byte[] rk = Bytes.toBytes("rk-" + i); Get get = new Get(rk); get.addColumn(cf, c); gets.add(get); } Result[] results = new Result[gets.size()]; table.batch(gets, results); {code} The log will looks like below: {code:java} [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - id=1 error for test processing localhost,16020,1624343786295 java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to finish on table: [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to finish on table: test {code} was: The batch api of HTable contains a param named results to store result or exception, its type is Object[]. If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason, then the ArrayStoreException will occur in AsyncRequestFutureImpl.updateResult, then the AsyncRequestFutureImpl.decActionCounter will be skipped, then in the AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the actionsInProgress again and again, forever. It is better to add an cutoff calculated by operationTimeout, instead of only depend on the value of actionsInProgress. BTW, this issue only for 2.x, since 3.x the implement has refactored. How to reproduce: 1: add sleep in RSRpcServices.multi to mock slow response {code:java} try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } {code} 2: set time out in config {code:java} conf.set("hbase.rpc.timeout","2000"); conf.set("hbase.client.operation.timeout","6000"); {code} 3: call
[jira] [Comment Edited] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369449#comment-17369449 ] Zheng Wang edited comment on HBASE-26027 at 6/25/21, 1:00 PM: -- {quote}Don't really get this: If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason how to reproduce it? {quote} {color:#172b4d}Added in description, thanks for the comment. [~reidchan] {color} was (Author: filtertip): {quote}Don't really get this: If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason how to reproduce it? {quote} {color:#172b4d}Added in description, thanks for the comment.{color} > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and we need to put an exception into > it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ]
[jira] [Comment Edited] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369449#comment-17369449 ] Zheng Wang edited comment on HBASE-26027 at 6/25/21, 1:00 PM: -- {quote}Don't really get this: If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason how to reproduce it? {quote} {color:#172b4d}Added in description, thanks for the comment.{color} was (Author: filtertip): {quote}Don't really get this: {quote}If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason {quote} how to reproduce it? {color:#172b4d}Added in description, thanks for the comment.{color} {quote} > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and we need to put an exception into > it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ]
[jira] [Comment Edited] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369449#comment-17369449 ] Zheng Wang edited comment on HBASE-26027 at 6/25/21, 12:59 PM: --- {quote}Don't really get this: {quote}If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason {quote} how to reproduce it? {color:#172b4d}Added in description, thanks for the comment.{color} {quote} was (Author: filtertip): {quote}Don't really get this: {quote}If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason {quote} how to reproduce it? {quote} Added in description, thanks for the comment.[~reidchan] > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and we need to put an exception into > it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ]
[jira] [Commented] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369449#comment-17369449 ] Zheng Wang commented on HBASE-26027: {quote}Don't really get this: {quote}If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason {quote} how to reproduce it? {quote} Added in description, thanks for the comment.[~reidchan] > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and we need to put an exception into > it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26027: --- Environment: (was: {code:java} {code}) > The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone > caused by ArrayStoreException > - > > Key: HBASE-26027 > URL: https://issues.apache.org/jira/browse/HBASE-26027 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 2.2.7, 2.3.5, 2.4.4 >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > > The batch api of HTable contains a param named results to store result or > exception, its type is Object[]. > If user pass an array with other type, eg: > org.apache.hadoop.hbase.client.Result, and we need to put an exception into > it by some reason, then the ArrayStoreException will occur in > AsyncRequestFutureImpl.updateResult, then the > AsyncRequestFutureImpl.decActionCounter will be skipped, then in the > AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the > actionsInProgress again and again, forever. > It is better to add an cutoff calculated by operationTimeout, instead of only > depend on the value of actionsInProgress. > BTW, this issue only for 2.x, since 3.x the implement has refactored. > How to reproduce: > 1: add sleep in RSRpcServices.multi to mock slow response > {code:java} > try { > Thread.sleep(2000); > } catch (InterruptedException e) { > e.printStackTrace(); > } > {code} > 2: set time out in config > {code:java} > conf.set("hbase.rpc.timeout","2000"); > conf.set("hbase.client.operation.timeout","6000"); > {code} > 3: call batch api > {code:java} > Table table = HbaseUtil.getTable("test"); > byte[] cf = Bytes.toBytes("f"); > byte[] c = Bytes.toBytes("c1"); > List gets = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > byte[] rk = Bytes.toBytes("rk-" + i); > Get get = new Get(rk); > get.addColumn(cf, c); > gets.add(get); > } > Result[] results = new Result[gets.size()]; > table.batch(gets, results); > {code} > The log will looks like below: > {code:java} > [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - > id=1 error for test processing localhost,16020,1624343786295 > java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) > at java.util.concurrent.FutureTask.run(FutureTask.java) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to > finish on table: > [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to > finish on table: test > [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to > finish on table: test > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-26027) The calling of HTable.batch blocked at AsyncRequestFutureImpl.waitUntilDone caused by ArrayStoreException
[ https://issues.apache.org/jira/browse/HBASE-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26027: --- Description: The batch api of HTable contains a param named results to store result or exception, its type is Object[]. If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason, then the ArrayStoreException will occur in AsyncRequestFutureImpl.updateResult, then the AsyncRequestFutureImpl.decActionCounter will be skipped, then in the AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the actionsInProgress again and again, forever. It is better to add an cutoff calculated by operationTimeout, instead of only depend on the value of actionsInProgress. BTW, this issue only for 2.x, since 3.x the implement has refactored. How to reproduce: 1: add sleep in RSRpcServices.multi to mock slow response {code:java} try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } {code} 2: set time out in config {code:java} conf.set("hbase.rpc.timeout","2000"); conf.set("hbase.client.operation.timeout","6000"); {code} 3: call batch api {code:java} Table table = HbaseUtil.getTable("test"); byte[] cf = Bytes.toBytes("f"); byte[] c = Bytes.toBytes("c1"); List gets = new ArrayList<>(); for (int i = 0; i < 10; i++) { byte[] rk = Bytes.toBytes("rk-" + i); Get get = new Get(rk); get.addColumn(cf, c); gets.add(get); } Result[] results = new Result[gets.size()]; table.batch(gets, results); {code} The log will looks like below: {code:java} [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - id=1 error for test processing localhost,16020,1624343786295 java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.updateResult(AsyncRequestFutureImpl.java:1242) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.trySetResultSimple(AsyncRequestFutureImpl.java:1087) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.setError(AsyncRequestFutureImpl.java:1021) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.manageError(AsyncRequestFutureImpl.java:683) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.receiveGlobalFailure(AsyncRequestFutureImpl.java:716) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.access$1500(AsyncRequestFutureImpl.java:69) at org.apache.hadoop.hbase.client.AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncRequestFutureImpl.java:219) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [INFO ] [2021/06/22 23:23:10,375] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:23:20,378] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:23:30,384] main - #1, waiting for 10 actions to finish on table: [INFO ] [2021/06/22 23:23:40,387] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:23:50,397] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:24:00,400] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:24:10,408] main - #1, waiting for 10 actions to finish on table: test [INFO ] [2021/06/22 23:24:20,413] main - #1, waiting for 10 actions to finish on table: test {code} was: The batch api of HTable contains a param named results to store result or exception, its type is Object[]. If user pass an array with other type, eg: org.apache.hadoop.hbase.client.Result, and we need to put an exception into it by some reason, then the ArrayStoreException will occur in AsyncRequestFutureImpl.updateResult, then the AsyncRequestFutureImpl.decActionCounter will be skipped, then in the AsyncRequestFutureImpl.waitUntilDone we will stuck at here checking the actionsInProgress again and again, forever. It is better to add an cutoff calculated by operationTimeout, instead of only depend on the value of actionsInProgress. BTW, this issue only for 2.x, since 3.x the implement has refactored. {code:java} [ERROR] [2021/06/22 23:23:00,676] hconnection-0x6b927fb-shared-pool3-t1 - id=1 error for test processing localhost,16020,1624343786295 java.lang.ArrayStoreException: org.apache.hadoop.hbase.DoNotRetryIOException at
[jira] [Updated] (HBASE-26028) The view as json page shows exception when using TinyLfuBlockCache
[ https://issues.apache.org/jira/browse/HBASE-26028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Wang updated HBASE-26028: --- Attachment: HBASE-26028-beforepatch.jpg HBASE-26028-afterpatch.jpg > The view as json page shows exception when using TinyLfuBlockCache > -- > > Key: HBASE-26028 > URL: https://issues.apache.org/jira/browse/HBASE-26028 > Project: HBase > Issue Type: Bug > Components: UI >Reporter: Zheng Wang >Assignee: Zheng Wang >Priority: Major > Attachments: HBASE-26028-afterpatch.jpg, HBASE-26028-beforepatch.jpg > > > Some variable in TinyLfuBlockCache should be marked as transient. -- This message was sent by Atlassian Jira (v8.3.4#803005)