[jira] [Created] (HBASE-28569) Race condition during WAL splitting leading to corrupt recovered.edits
Benoit Sigoure created HBASE-28569: -- Summary: Race condition during WAL splitting leading to corrupt recovered.edits Key: HBASE-28569 URL: https://issues.apache.org/jira/browse/HBASE-28569 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 2.4.17 Reporter: Benoit Sigoure There is a race condition that can happen when a regionserver aborts initialisation while splitting a WAL from another regionserver. This race leads to writing the WAL trailer for recovered edits while the writer threads are still running, thus the trailer gets interleaved with the edits corrupting the recovered edits file (and preventing the region to be assigned). We've seen this happening on HBase 2.4.17, but looking at the latest code it seems that the race can still happen there. The sequence of operations that leads to this issue: * {{org.apache.hadoop.hbase.wal.WALSplitter.splitWAL}} calls {{outputSink.close()}} after adding all the entries to the buffers * The output sink is {{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink}} and its {{close}} method calls first {{finishWriterThreads}} in a try block which in turn will call {{finish}} on every thread and then join it to make sure it's done. * However if the splitter thread gets interrupted because of RS aborting, the join will get interrupted and {{finishWriterThreads}} will rethrow without waiting for the writer threads to stop. * This is problematic because coming back to {{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink.close}} it will call {{closeWriters}} in a finally block (so it will execute even when the join was interrupted). * {{closeWriters}} will call {{org.apache.hadoop.hbase.wal.AbstractRecoveredEditsOutputSink.closeRecoveredEditsWriter}} which will call {{close}} on {{{}editWriter.writer{}}}. * When {{editWriter.writer}} is {{{}org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter{}}}, its {{close}} method will write the trailer before closing the file. * This trailer write will now go in parallel with writer threads writing entries causing corruption. * If there are no other errors, {{closeWriters}} will succeed renaming all temporary files to final recovered edits, causing problems next time the region is assigned. Logs evidence supporting the above flow: Abort is triggered (because it failed to open the WAL due to some ongoing infra issue): {noformat} regionserver-2 regionserver 06:22:00.384 [RS_OPEN_META-regionserver/host01:16201-0] ERROR org.apache.hadoop.hbase.regionserver.HRegionServer - * ABORTING region server host01,16201,1709187641249: WAL can not clean up after init failed *{noformat} We can see that the writer threads were still active after closing (even considering that the ordering in the log might not be accurate, we see that they die because the channel is closed while still writing, not because they're stopping): {noformat} regionserver-2 regionserver 06:22:09.662 [DataStreamer for file /hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201%2C1709180140645.1709186722780.temp block BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368] WARN org.apache.hadoop.hdfs.DataStreamer - Error Recovery for BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368 in pipeline [DatanodeInfoWithStorage[192.168.2.230:15010,DS-2aa201ab-1027-47ec-b05f-b39d795fda85,DISK], DatanodeInfoWithStorage[192.168.2.232:15010,DS-39651d5a-67d2-4126-88f0-45cdee967dab,DISK], Datanode InfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK]]: datanode 2(DatanodeInfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK]) is bad. regionserver-2 regionserver 06:22:09.742 [split-log-closeStream-pool-1] INFO org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Closed recovered edits writer path=hdfs://mycluster/hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201% 2C1709180140645.1709186722780.temp (wrote 5949 edits, skipped 0 edits in 93 ms) regionserver-2 regionserver 06:22:09.743 [RS_LOG_REPLAY_OPS-regionserver/host01:16201-1-Writer-0] ERROR org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Failed to write log entry aeris_v2/53308260a6b22eaf6ebb8353f7df3077/3169611655=[#edits: 8 = ] to log regionserver-2 regionserver java.nio.channels.ClosedChannelException: null regionserver-2 regionserver at org.apache.hadoop.hdfs.ExceptionLastSeen.throwException4Close(ExceptionLastSeen.java:73) ~[hadoop-hdfs-client-3.2.4.jar:?] regionserver-2 regionserver at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:153) ~[hadoop-hdfs-client-3.2.4.jar:?] regionserver-2 regionserver at
[jira] [Updated] (HBASE-28569) Race condition during WAL splitting leading to corrupt recovered.edits
[ https://issues.apache.org/jira/browse/HBASE-28569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-28569: --- Description: There is a race condition that can happen when a regionserver aborts initialisation while splitting a WAL from another regionserver. This race leads to writing the WAL trailer for recovered edits while the writer threads are still running, thus the trailer gets interleaved with the edits corrupting the recovered edits file (and preventing the region to be assigned). We've seen this happening on HBase 2.4.17, but looking at the latest code it seems that the race can still happen there. The sequence of operations that leads to this issue: * {{org.apache.hadoop.hbase.wal.WALSplitter.splitWAL}} calls {{outputSink.close()}} after adding all the entries to the buffers * The output sink is {{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink}} and its {{close}} method calls first {{finishWriterThreads}} in a try block which in turn will call {{finish}} on every thread and then join it to make sure it's done. * However if the splitter thread gets interrupted because of RS aborting, the join will get interrupted and {{finishWriterThreads}} will rethrow without waiting for the writer threads to stop. * This is problematic because coming back to {{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink.close}} it will call {{closeWriters}} in a finally block (so it will execute even when the join was interrupted). * {{closeWriters}} will call {{org.apache.hadoop.hbase.wal.AbstractRecoveredEditsOutputSink.closeRecoveredEditsWriter}} which will call {{close}} on {{{}editWriter.writer{}}}. * When {{editWriter.writer}} is {{{}org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter{}}}, its {{close}} method will write the trailer before closing the file. * This trailer write will now go in parallel with writer threads writing entries causing corruption. * If there are no other errors, {{closeWriters}} will succeed renaming all temporary files to final recovered edits, causing problems next time the region is assigned. Logs evidence supporting the above flow: Abort is triggered (because it failed to open the WAL due to some ongoing infra issue): {noformat} regionserver-2 regionserver 06:22:00.384 [RS_OPEN_META-regionserver/host01:16201-0] ERROR org.apache.hadoop.hbase.regionserver.HRegionServer - * ABORTING region server host01,16201,1709187641249: WAL can not clean up after init failed *{noformat} We can see that the writer threads were still active after closing (even considering that the ordering in the log might not be accurate, we see that they die because the channel is closed while still writing, not because they're stopping): {noformat} regionserver-2 regionserver 06:22:09.662 [DataStreamer for file /hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201%2C1709180140645.1709186722780.temp block BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368] WARN org.apache.hadoop.hdfs.DataStreamer - Error Recovery for BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368 in pipeline [DatanodeInfoWithStorage[192.168.2.230:15010,DS-2aa201ab-1027-47ec-b05f-b39d795fda85,DISK], DatanodeInfoWithStorage[192.168.2.232:15010,DS-39651d5a-67d2-4126-88f0-45cdee967dab,DISK], Datanode InfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK]]: datanode 2(DatanodeInfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK]) is bad. regionserver-2 regionserver 06:22:09.742 [split-log-closeStream-pool-1] INFO org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Closed recovered edits writer path=hdfs://mycluster/hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201% 2C1709180140645.1709186722780.temp (wrote 5949 edits, skipped 0 edits in 93 ms) regionserver-2 regionserver 06:22:09.743 [RS_LOG_REPLAY_OPS-regionserver/host01:16201-1-Writer-0] ERROR org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Failed to write log entry aeris_v2/53308260a6b22eaf6ebb8353f7df3077/3169611655=[#edits: 8 = ] to log regionserver-2 regionserver java.nio.channels.ClosedChannelException: null regionserver-2 regionserver at org.apache.hadoop.hdfs.ExceptionLastSeen.throwException4Close(ExceptionLastSeen.java:73) ~[hadoop-hdfs-client-3.2.4.jar:?] regionserver-2 regionserver at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:153) ~[hadoop-hdfs-client-3.2.4.jar:?] regionserver-2 regionserver at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) ~[hadoop-common-3.2.4.jar:?] regionserver-2 regionserver at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) ~[hadoop-common-3.2.4.jar:?] regionserver-2
[jira] [Commented] (HBASE-27696) [hbase-operator-tools] Use $revision as placeholder for maven version
[ https://issues.apache.org/jira/browse/HBASE-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705733#comment-17705733 ] Benoit Sigoure commented on HBASE-27696: Hi guys, any plans to cut a release any time soon? > [hbase-operator-tools] Use $revision as placeholder for maven version > - > > Key: HBASE-27696 > URL: https://issues.apache.org/jira/browse/HBASE-27696 > Project: HBase > Issue Type: Task > Components: build, pom >Affects Versions: hbase-operator-tools-1.3.0 >Reporter: Nick Dimiduk >Assignee: Nick Dimiduk >Priority: Major > Fix For: hbase-operator-tools-1.3.0 > > > To align with our main repo. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27357) Online schema migration causes the master to report a very high or negative number of QPS
Benoit Sigoure created HBASE-27357: -- Summary: Online schema migration causes the master to report a very high or negative number of QPS Key: HBASE-27357 URL: https://issues.apache.org/jira/browse/HBASE-27357 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.4.12 Environment: |JVM Version|Oracle Corporation 11.0.15-11.0.15+10| Reporter: Benoit Sigoure Attachments: Screen Shot 2022-09-05 at 18.31.00.png, Screen Shot 2022-09-05 at 18.31.06.png We've seen this a few times when making an online schema change, e.g.: {code:java} alter 'foo', {NAME=>'e',VERSIONS=>'2147483646'} {code} This causes the master to briefly show extremely high QPS per regionserver, and probably causes an integer overflow in the sum. !Screen Shot 2022-09-05 at 18.31.00.png|width=859,height=323! ... !Screen Shot 2022-09-05 at 18.31.06.png|width=856,height=322! This could be related to the issue reported in HBASE-27242. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27242) HBase master reports a region has been in transition for 50+ years
[ https://issues.apache.org/jira/browse/HBASE-27242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-27242: --- Description: Every time we upgrade our HBase clusters we get some spurious alerts firing because for a brief period of time the HBase master reports some impossibly high RIT (region in transition) time. For example: !image.png|width=747,height=249! The condition resolves itself on its own within a few minutes. I'll try to find some relevant logs and attach them to this issue. was: Every time we upgrade our HBase clusters we get some spurious alerts firing because for a brief period of time the HBase master reports some impossibly high RIT (region in transition) time. For example: !image.png! The condition resolves itself on its own within a few minutes. I'll try to find some relevant logs and attach them to this issue. > HBase master reports a region has been in transition for 50+ years > -- > > Key: HBASE-27242 > URL: https://issues.apache.org/jira/browse/HBASE-27242 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.4.8, 2.4.9, 2.4.10, 2.4.11 > Environment: openjdk version "11.0.14.1" 2022-02-08 > OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1) > OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing) >Reporter: Benoit Sigoure >Priority: Major > Attachments: image.png > > > Every time we upgrade our HBase clusters we get some spurious alerts firing > because for a brief period of time the HBase master reports some impossibly > high RIT (region in transition) time. > For example: > !image.png|width=747,height=249! > > The condition resolves itself on its own within a few minutes. I'll try to > find some relevant logs and attach them to this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27242) HBase master reports a region has been in transition for 50+ years
Benoit Sigoure created HBASE-27242: -- Summary: HBase master reports a region has been in transition for 50+ years Key: HBASE-27242 URL: https://issues.apache.org/jira/browse/HBASE-27242 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.4.11, 2.4.10, 2.4.9, 2.4.8 Environment: openjdk version "11.0.14.1" 2022-02-08 OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1) OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing) Reporter: Benoit Sigoure Attachments: image.png Every time we upgrade our HBase clusters we get some spurious alerts firing because for a brief period of time the HBase master reports some impossibly high RIT (region in transition) time. For example: !image.png! The condition resolves itself on its own within a few minutes. I'll try to find some relevant logs and attach them to this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed'
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505128#comment-17505128 ] Benoit Sigoure commented on HBASE-26042: For some reason Mike can't upload files (maybe new accounts aren't immediately allowed to upload attachments?), anyway, I just posted a heap dump along with the thread dump that was taken at the ~same time. > WAL lockup on 'sync failed' > --- > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2, > regionserver-heap-live.hprof.gz, regionserver-threaddump.log > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed'
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Attachment: regionserver-heap-live.hprof.gz > WAL lockup on 'sync failed' > --- > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2, > regionserver-heap-live.hprof.gz, regionserver-threaddump.log > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 > (Interpreted frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed'
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Attachment: regionserver-threaddump.log > WAL lockup on 'sync failed' > --- > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2, > regionserver-threaddump.log > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 > (Interpreted frame) > - com.lmax.disruptor.RingBuffer.next() @bci=4, line=263
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed'
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505123#comment-17505123 ] Benoit Sigoure commented on HBASE-26042: Hi Andrew, thanks for your reply. I already attached the regionserver logs as well as the stack trace {{/dump}} from the servlet. Mike is going to post a heap dump soon. We've been seeing quite a few instances of this bug lately, I think a number of the "HBase is stuck" kinda reports I've heard about over the past year or so were likely due to this bug. We are able to reproduce it relatively easily by taking a cluster and killing nodes randomly. > WAL lockup on 'sync failed' > --- > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed'
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Summary: WAL lockup on 'sync failed' (was: WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer) > WAL lockup on 'sync failed' > --- > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 > (Compiled frame) > -
[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502810#comment-17502810 ] Benoit Sigoure commented on HBASE-26042: We've run into this issue on a test cluster with HBase 2.4.8. Let me know if I can collect anything else to help you, as things are still stuck right now and we can probably keep it untouched for another day or two as it's a test cluster. > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465:
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Affects Version/s: 2.4.8 > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Attachment: debug-dump.txt > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Attachment: hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468258#comment-17468258 ] Benoit Sigoure commented on HBASE-21476: Uploaded a patch for use against HBase version 2.4.9 > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, > HBASE-21476.branch-2.4.001.patch, nanosecond_timestamps_v1.patch, > nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps
[ https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-21476: --- Attachment: HBASE-21476.branch-2.4.001.patch > Support for nanosecond timestamps > - > > Key: HBASE-21476 > URL: https://issues.apache.org/jira/browse/HBASE-21476 > Project: HBase > Issue Type: New Feature >Affects Versions: 2.1.1 >Reporter: Andrey Elenskiy >Assignee: Andrey Elenskiy >Priority: Major > Labels: features, patch > Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, > HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, > HBASE-21476.branch-2.4.001.patch, nanosecond_timestamps_v1.patch, > nanosecond_timestamps_v2.patch > > > Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to > handle timestamps with nanosecond precision. This is useful for applications > that timestamp updates at the source with nanoseconds and still want features > like column family TTL and "hbase.hstore.time.to.purge.deletes" to work. > The attribute should be specified either on new tables or on existing tables > which have timestamps only with nanosecond precision. There's no migration > from milliseconds to nanoseconds for already existing tables. We could add > this migration as part of compaction if you think that would be useful, but > that would obviously make the change more complex. > I've added a new EnvironmentEdge method "currentTimeNano()" that uses > [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html] > to get time in nanoseconds which means it will only work with Java 8. The > idea is to gradually replace all places where "EnvironmentEdge.currentTime()" > is used to have HBase working purely with nanoseconds (which is a > prerequisite for HBASE-14070). Also, I've refactored ScanInfo and > PartitionedMobCompactor to expect TableDescriptor as an argument which makes > code a little cleaner and easier to extend. > Couple more points: > - column family TTL (specified in seconds) and > "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options > don't need to be changed, those are adjusted automatically. > - Per cell TTL needs to be scaled by clients accordingly after > "NANOSECOND_TIMESTAMPS" table attribute is specified. > Looking for everyone's feedback to know if that's a worthwhile direction. > Will add more comprehensive tests in a later patch. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HBASE-26383) HBCK incorrectly reports inconsistencies for recently split regions following a master failover
Benoit Sigoure created HBASE-26383: -- Summary: HBCK incorrectly reports inconsistencies for recently split regions following a master failover Key: HBASE-26383 URL: https://issues.apache.org/jira/browse/HBASE-26383 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.4.3 Reporter: Benoit Sigoure When a region P splits into A and B, following a master failover the newly active master reports that P is in an inconsistent state. This seems to be a regression introduced in HBASE-25847 (cc [~andrew.purt...@gmail.com]) which changed {{regionInfo.isParentSplit()}} to {{regionState.isSplit()}}. The region state after restart is CLOSED (rather than SPLIT), so both region state and region info should be checked, presumably with {{regionState.isSplit() || regionInfo.isSplit()}}. This situation resolves itself on its own when a major compaction occurs and P is GCed, but having the master incorrectly report inconsistencies is pretty bad. We had a pretty big outage due to a series of operator errors as our SRE team was trying to fix this inconsistency that, in fact, didn't even exist. Thanks to Stack for helping look over this issue and Vlad Hanciuta for root causing the bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"
[ https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478183#comment-16478183 ] Benoit Sigoure commented on HBASE-20463: You're right, my bad. Once I downgraded to JDK8 things started working again. Thanks! > Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL > change and document" > -- > > Key: HBASE-20463 > URL: https://issues.apache.org/jira/browse/HBASE-20463 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: stack >Assignee: Sean Busbey >Priority: Blocker > Fix For: 1.5.0, 1.4.4 > > Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch > > > Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] > (and [~apurtell]). > See parent for discussion on breakage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"
[ https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499 ] Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:32 PM: - I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code} {code} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} was (Author: tsuna): I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code:java} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} > Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL > change and document" > -- > > Key: HBASE-20463 > URL: https://issues.apache.org/jira/browse/HBASE-20463 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: stack >Assignee: Sean Busbey >Priority: Blocker > Fix For: 1.5.0, 1.4.4 > > Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch > > > Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] > (and [~apurtell]). > See parent for discussion on breakage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"
[ https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499 ] Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:32 PM: - I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} was (Author: tsuna): I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code} {code} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} > Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL > change and document" > -- > > Key: HBASE-20463 > URL: https://issues.apache.org/jira/browse/HBASE-20463 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: stack >Assignee: Sean Busbey >Priority: Blocker > Fix For: 1.5.0, 1.4.4 > > Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch > > > Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] > (and [~apurtell]). > See parent for discussion on breakage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"
[ https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499 ] Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:31 PM: - I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} was (Author: tsuna): I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} > Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL > change and document" > -- > > Key: HBASE-20463 > URL: https://issues.apache.org/jira/browse/HBASE-20463 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: stack >Assignee: Sean Busbey >Priority: Blocker > Fix For: 1.5.0, 1.4.4 > > Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch > > > Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] > (and [~apurtell]). > See parent for discussion on breakage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"
[ https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499 ] Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:31 PM: - I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code:java} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} was (Author: tsuna): I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} > Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL > change and document" > -- > > Key: HBASE-20463 > URL: https://issues.apache.org/jira/browse/HBASE-20463 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: stack >Assignee: Sean Busbey >Priority: Blocker > Fix For: 1.5.0, 1.4.4 > > Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch > > > Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] > (and [~apurtell]). > See parent for discussion on breakage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"
[ https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499 ] Benoit Sigoure commented on HBASE-20463: I'm still seeing the error with HBase 1.4.4 (I'm using the binary release [here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz]) {code:java} foo@cc2a495eedfe:~$ java -version openjdk version "9.0.4" OpenJDK Runtime Environment (build 9.0.4+12-Debian-4) OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code} {code} {code:java} foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker (file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method java.lang.Object.registerNatives() WARNING: Please consider reporting this to the maintainers of org.jruby.java.invokers.RubyToJavaInvoker WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ArgumentError: wrong number of arguments (0 for 1) method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10 method_added at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129 Pattern at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1 require at org/jruby/RubyKernel.java:1062 (root) at file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42 (root) at /home/foo/hbase/bin/../bin/hirb.rb:38 {code} > Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL > change and document" > -- > > Key: HBASE-20463 > URL: https://issues.apache.org/jira/browse/HBASE-20463 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: stack >Assignee: Sean Busbey >Priority: Blocker > Fix For: 1.5.0, 1.4.4 > > Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch > > > Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] > (and [~apurtell]). > See parent for discussion on breakage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-18372) Potential infinite busy loop in HMaster's ProcedureExecutor
Benoit Sigoure created HBASE-18372: -- Summary: Potential infinite busy loop in HMaster's ProcedureExecutor Key: HBASE-18372 URL: https://issues.apache.org/jira/browse/HBASE-18372 Project: HBase Issue Type: Bug Components: master Affects Versions: 1.3.1 Environment: Kernel 3.10.0-327.10.1.el7.x86_64 JVM 1.8.0_102 Reporter: Benoit Sigoure While investigating an issue today with [~timoha] we saw the HMaster consistently burning 1.5 cores of CPU cycles. Upon looking more closely, it was actually all 8 threads of {{ProcedureExecutor}} thread pool taking constantly ~15% of a CPU core each (I identified this by looking at individual threads in {{top}} and cross-referencing the thread IDs with the thread IDs in a JVM stack trace). The HMaster log or output didn't contain anything suspicious and it was hard for us to ascertain what exactly was happening. It just looked like these threads were regularly spinning, doing nothing. We just saw a lot of {{futex}} system calls happening all the time, and all the threads of the thread pool regularly taking turns in waking up and going back to sleep. My reading of the code in {{procedure2/ProcedureExecutor.java}} is that this can happen if the threads in the thread pool have been interrupted for some reason: {code} private void execLoop() { while (isRunning()) { Procedure proc = runnables.poll(); if (proc == null) continue; {code} and then in {master/procedure/MasterProcedureScheduler.java}: {code} @Override public Procedure poll() { return poll(-1); } @edu.umd.cs.findbugs.annotations.SuppressWarnings("WA_AWAIT_NOT_IN_LOOP") Procedure poll(long waitNsec) { Procedure pollResult = null; schedLock.lock(); try { if (queueSize == 0) { if (waitNsec < 0) { schedWaitCond.await(); [...] } catch (InterruptedException e) { Thread.currentThread().interrupt(); } finally { schedLock.unlock(); } return pollResult; } {code} so my theory is the threads in the thread pool have all been interrupted (maybe by a procedure that ran earlier and left its thread interrupted) and so we are perpetually looping in {{execLoop}}, which ends up calling {{schedWaitCond.await();}}, which ends up throwing an {{InterruptedException}}, which ends up resetting the interrupt status of the thread, and rinse and repeat. But again I wasn't able to get any cold hard evidence that this is what was happening. There was just no other evidence that could explain this behavior, and I wasn't able to guess what else could be causing this that was consistent with what we saw and what I understood from reading the code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18042) Client Compatibility breaks between versions 1.2 and 1.3
[ https://issues.apache.org/jira/browse/HBASE-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011327#comment-16011327 ] Benoit Sigoure commented on HBASE-18042: We can easily update AsyncHBase to accommodate to the change however I would like to voice disagreement with this statement: {quote} It is an unfortunate thing that we have broken the semantics, but in general this is "allowed". {quote} Such semantic changes are like breaking API changes, they are, well, breaking changes. Not cool. One of the challenges with AsyncHBase is that it has to work with all versions of HBase. Since {{more_results_in_region}} was already there in 1.2 but needs to be handled differently in 1.3, that makes it kinda hard for AsyncHBase to know how, exactly, to deal with this flag being set, right? > Client Compatibility breaks between versions 1.2 and 1.3 > > > Key: HBASE-18042 > URL: https://issues.apache.org/jira/browse/HBASE-18042 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Karan Mehta >Assignee: Karan Mehta > > OpenTSDB uses AsyncHBase as its client, rather than using the traditional > HBase Client. From version 1.2 to 1.3, the {{ClientProtos}} have been > changed. Newer fields are added to {{ScanResponse}} proto. > For a typical Scan request in 1.2, would require caller to make an > OpenScanner Request, GetNextRows Request and a CloseScanner Request, based on > {{more_rows}} boolean field in the {{ScanResponse}} proto. > However, from 1.3, new parameter {{more_results_in_region}} was added, which > limits the results per region. Therefore the client has to now manage sending > all the requests for each region. Further more, if the results are exhausted > from a particular region, the {{ScanResponse}} will set > {{more_results_in_region}} to false, but {{more_results}} can still be true. > Whenever the former is set to false, the {{RegionScanner}} will also be > closed. > OpenTSDB makes an OpenScanner Request and receives all its results in the > first {{ScanResponse}} itself, thus creating a condition as described in > above paragraph. Since {{more_rows}} is true, it will proceed to send next > request at which point the {{RSRpcServices}} will throw > {{UnknownScannerException}}. The protobuf client compatibility is maintained > but expected behavior is modified. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17489) ClientScanner may send a next request to a RegionScanner which has been exhausted
[ https://issues.apache.org/jira/browse/HBASE-17489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001480#comment-16001480 ] Benoit Sigoure commented on HBASE-17489: AsyncHBase expects the scanner ID in response to scanning more rows but that's not actually necessary. I think I added this as a sanity check because I expected the server to always return the ID, but as was said above it's technically not strictly necessary for the server to return the ID on subsequent uses of the scanner. The code doesn't even do anything with the scanner ID other than checking that it's the ID we expected: {code} @Override Response deserialize(final ChannelBuffer buf, final int cell_size) { final ScanResponse resp = readProtobuf(buf, ScanResponse.PARSER); final long id = resp.getScannerId(); if (scanner_id != id) { throw new InvalidResponseException("Scan RPC response was for scanner" + " ID " + id + " but we expected" + scanner_id, resp); } final ArrayListrows = getRows(resp, buf, cell_size); if (rows == null) { return null; } return new Response(resp.getScannerId(), rows, resp.getMoreResults()); } {code} I guess we could fix this by saying "if we have a scanner ID in the response THEN it must match the one expect" instead of "there must be a scanner ID in the response that matches what we expect". Ironically we had the same bug in GoHBase, where we made the same assumption that the scanner ID was always present in the response. > ClientScanner may send a next request to a RegionScanner which has been > exhausted > - > > Key: HBASE-17489 > URL: https://issues.apache.org/jira/browse/HBASE-17489 > Project: HBase > Issue Type: Bug > Components: Client, scan >Affects Versions: 2.0.0, 1.3.0, 1.4.0 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 2.0.0, 1.4.0, 1.3.1 > > Attachments: HBASE-17489-branch-1.3.patch, > HBASE-17489-branch-1.patch, HBASE-17489.patch, HBASE-17489-v1.patch, > HBASE-17489-v2.patch, HBASE-17489-v3.patch, HBASE-17489-v4.patch, > HBASE-17489-v4.patch, HBASE-17489-v5.patch, HBASE-17489-v6.patch > > > Found it when implementing HBASE-17045. Seems the final result of the scan is > correct but no doubt the logic is broken. We need to fix it to stop things > get worse in the future. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-13329) ArrayIndexOutOfBoundsException in CellComparator#getMinimumMidpointArray
[ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627056#comment-14627056 ] Benoit Sigoure commented on HBASE-13329: I'm kinda late to the party but yeah OpenTSDB compactions might cause long column qualifiers. OpenTSDB doesn't generally use long row keys though, so that makes total sense. Thanks for getting to the bottom of this one! ArrayIndexOutOfBoundsException in CellComparator#getMinimumMidpointArray Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar Assignee: Lars Hofhansl Priority: Critical Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.2 Attachments: 13329-asserts.patch, 13329-v1.patch, 13329.txt, HBASE-13329.test.00.branch-1.1.patch While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24 18:04:35,722 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size 128.04 MB 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743 at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478) at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at
[jira] [Commented] (HBASE-13331) Exceptions from DFS client can cause CatalogJanitor to delete referenced files
[ https://issues.apache.org/jira/browse/HBASE-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378826#comment-14378826 ] Benoit Sigoure commented on HBASE-13331: [How Apache Hadoop is molesting IOException all day|http://blog.tsunanet.net/2012/04/apache-hadoop-abuse-ioexception.html] Exceptions from DFS client can cause CatalogJanitor to delete referenced files -- Key: HBASE-13331 URL: https://issues.apache.org/jira/browse/HBASE-13331 Project: HBase Issue Type: Bug Components: master Affects Versions: 1.0.0, 0.98.12 Reporter: Elliott Clark Assignee: Elliott Clark Priority: Blocker Fix For: 2.0.0, 1.0.1, 1.1.0, 0.98.13 Attachments: HBASE-13331.patch CatalogJanitor#checkDaughterInFs assumes that there are no references whenever HRegionFileSystem.openRegionFromFileSystem throws IOException. Well Hadoop and HBase throw IOExceptions whenever someone looks in their general direction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13331) Exceptions from DFS client can cause CatalogJanitor to delete referenced files
[ https://issues.apache.org/jira/browse/HBASE-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378831#comment-14378831 ] Benoit Sigoure commented on HBASE-13331: Ah, see HBASE-5796 Exceptions from DFS client can cause CatalogJanitor to delete referenced files -- Key: HBASE-13331 URL: https://issues.apache.org/jira/browse/HBASE-13331 Project: HBase Issue Type: Bug Components: master Affects Versions: 1.0.0, 0.98.12 Reporter: Elliott Clark Assignee: Elliott Clark Priority: Blocker Fix For: 2.0.0, 1.0.1, 1.1.0, 0.98.13 Attachments: HBASE-13331.patch CatalogJanitor#checkDaughterInFs assumes that there are no references whenever HRegionFileSystem.openRegionFromFileSystem throws IOException. Well Hadoop and HBase throw IOExceptions whenever someone looks in their general direction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-5539: -- Attachment: 0001-AsyncHBase-PerformanceEvaluation.patch New patch for the latest of 0.96. AFAICT this still hasn't been committed. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: Performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Fix For: 0.95.0 Attachments: 0001-AsyncHBase-PerformanceEvaluation.patch, 0001-AsyncHBase-PerformanceEvaluation.patch, 0001-AsyncHBase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt, 5539-asynchbase-PerformanceEvaluation-v5.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11487) ScanResponse carries non-zero cellblock for CloseScanRequest (ScanRequest with close_scanner = true)
[ https://issues.apache.org/jira/browse/HBASE-11487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056955#comment-14056955 ] Benoit Sigoure commented on HBASE-11487: Thanks for the patch Shengzhe. ScanResponse carries non-zero cellblock for CloseScanRequest (ScanRequest with close_scanner = true) Key: HBASE-11487 URL: https://issues.apache.org/jira/browse/HBASE-11487 Project: HBase Issue Type: Bug Components: IPC/RPC, regionserver Affects Versions: 0.96.2, 0.99.0, 2.0.0 Reporter: Shengzhe Yao Assignee: Shengzhe Yao Priority: Minor Fix For: 2.0.0 Attachments: HBase_11487_v1.patch After upgrading hbase from 0.94 to 0.96, we've found that our asynchbase client keep throwing errors during normal scan. It turns out these errors are due to Scanner.close call in asynchbase. Since asynchbase assumes the ScanResponse of CloseScannerRequest should never carry any cellblocks, it will throw an exception if there is a violation. In the asynchbase client (1.5.0), it constructs a CloseScannerRequest in the following way, ScanRequest.newBuilder() .setScannerId(scanner_id) .setCloseScanner(true) .build(); Note, it does not set numOfRows, which kind of make sense. Why a close scanner request cares about number of rows to scan ? However, after narrowing down the CloseScannerRequest code path, it seems the issue is on regionserver side. In RsRpcServices.scan, we always init numOfRows to scan to 1 and we do this even for ScanRequest with close_scanner = true. This causes response for CloseScannerRequest will carry a cellBlock (if scan stops before the end row and this could happen in many normal scenarios) There are two fixes, either we always set numOfRows in asynchbase client side when constructing a CloseScannerRequest or we fix the default value in the server side. From a hbase client side point of view, it seems make less sense that server will send you a cellBlock for your close scanner request, unless the request explicitly asks for. We've made the change in our server code and the asynchbase client errors goes away. In addition to this issue, I want to know if we have any specifications for our hbase rpc. Like if close_scanner = true in ScanRequest and numOfRows is not set, ScanResponse guarantees that there is no cellBlock in the response. Since we moved to protobuf and many fields are optional for compatibility consideration, it might be helpful to have such specification which helps people to develop code that depends on hbase rpc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-5539: -- Attachment: 0001-AsyncHBase-PerformanceEvaluation.patch Updated patch for the 0.96 branch, that also builds with JDK 7. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: Performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Fix For: 0.95.0 Attachments: 0001-AsyncHBase-PerformanceEvaluation.patch, 0001-AsyncHBase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt, 5539-asynchbase-PerformanceEvaluation-v5.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-10422) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase
Benoit Sigoure created HBASE-10422: -- Summary: ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase Key: HBASE-10422 URL: https://issues.apache.org/jira/browse/HBASE-10422 Project: HBase Issue Type: Bug Components: Client, Protobufs Affects Versions: 0.96.1.1, 0.96.1, 0.98.0 Reporter: Benoit Sigoure In HBASE-9868 the trick that AsyncHBase uses to extract byte arrays from protobufs without copying them was ported, however the signature of {{zeroCopyGetBytes}} was changed for some reason. There are two problems with the changed signature: # It makes the helper function unusable since it refers to a package-private class. # It clashes with the signature AsyncHBase expects, thereby making user's life miserable for those who pull in both AsyncHBase and HBase on their classpath. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10422) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase
[ https://issues.apache.org/jira/browse/HBASE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-10422: --- Attachment: 10422-Fix-the-signature-of-zeroCopyGetBytes-to-make-it-us.patch Patch to fix the issue. ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase -- Key: HBASE-10422 URL: https://issues.apache.org/jira/browse/HBASE-10422 Project: HBase Issue Type: Bug Components: Client, Protobufs Affects Versions: 0.98.0, 0.96.1, 0.96.1.1 Reporter: Benoit Sigoure Attachments: 10422-Fix-the-signature-of-zeroCopyGetBytes-to-make-it-us.patch In HBASE-9868 the trick that AsyncHBase uses to extract byte arrays from protobufs without copying them was ported, however the signature of {{zeroCopyGetBytes}} was changed for some reason. There are two problems with the changed signature: # It makes the helper function unusable since it refers to a package-private class. # It clashes with the signature AsyncHBase expects, thereby making user's life miserable for those who pull in both AsyncHBase and HBase on their classpath. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10422) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase
[ https://issues.apache.org/jira/browse/HBASE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-10422: --- Status: Patch Available (was: Open) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase -- Key: HBASE-10422 URL: https://issues.apache.org/jira/browse/HBASE-10422 Project: HBase Issue Type: Bug Components: Client, Protobufs Affects Versions: 0.96.1.1, 0.96.1, 0.98.0 Reporter: Benoit Sigoure Attachments: 10422-Fix-the-signature-of-zeroCopyGetBytes-to-make-it-us.patch In HBASE-9868 the trick that AsyncHBase uses to extract byte arrays from protobufs without copying them was ported, however the signature of {{zeroCopyGetBytes}} was changed for some reason. There are two problems with the changed signature: # It makes the helper function unusable since it refers to a package-private class. # It clashes with the signature AsyncHBase expects, thereby making user's life miserable for those who pull in both AsyncHBase and HBase on their classpath. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-10119) Allow HBase coprocessors to clean up when they fail
Benoit Sigoure created HBASE-10119: -- Summary: Allow HBase coprocessors to clean up when they fail Key: HBASE-10119 URL: https://issues.apache.org/jira/browse/HBASE-10119 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Benoit Sigoure In the thread [Giving a chance to buggy coprocessors to clean up|http://osdir.com/ml/general/2013-12/msg17334.html] I brought up the issue that coprocessors currently don't have a chance to release their own resources (be they internal resources within the JVM, or external resources elsewhere) when they get forcefully removed due to an uncaught exception escaping. It would be nice to fix that, either by adding an API called by the {{CoprocessorHost}} when killing a faulty coprocessor, or by guaranteeing that the coprocessor's {{stop()}} method will be invoked then. This feature request is actually pretty important due to bug HBASE-9046, which means that it's not possible to properly clean up a coprocessor without restarting the RegionServer (!!). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HBASE-10119) Allow HBase coprocessors to clean up when they fail
[ https://issues.apache.org/jira/browse/HBASE-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-10119: --- Status: Patch Available (was: Open) Allow HBase coprocessors to clean up when they fail --- Key: HBASE-10119 URL: https://issues.apache.org/jira/browse/HBASE-10119 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Benoit Sigoure Attachments: HBASE-10119.patch In the thread [Giving a chance to buggy coprocessors to clean up|http://osdir.com/ml/general/2013-12/msg17334.html] I brought up the issue that coprocessors currently don't have a chance to release their own resources (be they internal resources within the JVM, or external resources elsewhere) when they get forcefully removed due to an uncaught exception escaping. It would be nice to fix that, either by adding an API called by the {{CoprocessorHost}} when killing a faulty coprocessor, or by guaranteeing that the coprocessor's {{stop()}} method will be invoked then. This feature request is actually pretty important due to bug HBASE-9046, which means that it's not possible to properly clean up a coprocessor without restarting the RegionServer (!!). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HBASE-10119) Allow HBase coprocessors to clean up when they fail
[ https://issues.apache.org/jira/browse/HBASE-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-10119: --- Attachment: HBASE-10119.patch Tentative patch to address the issue by making sure we call the coprocessor's {{stop()}} when forcefully removing it. This is the patch I'm using in production right now, it's working well for me. Sorry I didn't have time to write the accompanying test. Allow HBase coprocessors to clean up when they fail --- Key: HBASE-10119 URL: https://issues.apache.org/jira/browse/HBASE-10119 Project: HBase Issue Type: New Feature Affects Versions: 0.96.0 Reporter: Benoit Sigoure Attachments: HBASE-10119.patch In the thread [Giving a chance to buggy coprocessors to clean up|http://osdir.com/ml/general/2013-12/msg17334.html] I brought up the issue that coprocessors currently don't have a chance to release their own resources (be they internal resources within the JVM, or external resources elsewhere) when they get forcefully removed due to an uncaught exception escaping. It would be nice to fix that, either by adding an API called by the {{CoprocessorHost}} when killing a faulty coprocessor, or by guaranteeing that the coprocessor's {{stop()}} method will be invoked then. This feature request is actually pretty important due to bug HBASE-9046, which means that it's not possible to properly clean up a coprocessor without restarting the RegionServer (!!). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HBASE-9941) The context ClassLoader isn't set while calling into a coprocessor
[ https://issues.apache.org/jira/browse/HBASE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843974#comment-13843974 ] Benoit Sigoure commented on HBASE-9941: --- In any place. For instance in {{prePut}} I could do {{new Foo()}} where the class {{Foo}} has never been used before, and thus only upon entering {{prePut}} would this class get loaded. The context ClassLoader isn't set while calling into a coprocessor -- Key: HBASE-9941 URL: https://issues.apache.org/jira/browse/HBASE-9941 Project: HBase Issue Type: Sub-task Components: Coprocessors Affects Versions: 0.96.0 Reporter: Benoit Sigoure Assignee: Andrew Purtell Fix For: 0.98.0 Whenever one of the methods of a coprocessor is invoked, the context {{ClassLoader}} isn't set to be the {{CoprocessorClassLoader}}. It's only set properly when calling the coprocessor's {{start}} method. This means that if the coprocessor code attempts to load classes using the context {{ClassLoader}}, it will fail to find the classes it's looking for. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HBASE-9046) Coprocessors can't be upgraded in service reliably
[ https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-9046: -- Summary: Coprocessors can't be upgraded in service reliably (was: Some region servers keep using an older version of coprocessor ) Coprocessors can't be upgraded in service reliably -- Key: HBASE-9046 URL: https://issues.apache.org/jira/browse/HBASE-9046 Project: HBase Issue Type: Sub-task Components: Coprocessors Affects Versions: 0.94.8, 0.96.0 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu Mar 31 21:46:45 PDT 2011 amd64 java version 1.6.0_07 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02) Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) hbase: 0.94.8, r1485407 hadoop: 1.0.4, r1393290 Reporter: iain wright Priority: Minor Fix For: 0.98.0 My team and another user from the mailing list have run into an issue where replacing the coprocessor jar in HDFS and reloading the table does not load the latest jar. It may load the latest version on some percentage of RS but not all of them. This may be a config oversight or a lack of understanding of a caching mechanism that has a purge capability, but I thought I would log it here for confirmation. Workaround is to name the coprocessor JAR uniquely, place in HDFS, and re-enable the table using the new jar's name. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (HBASE-10106) Remove some unnecessary code from TestOpenTableInCoprocessor
Benoit Sigoure created HBASE-10106: -- Summary: Remove some unnecessary code from TestOpenTableInCoprocessor Key: HBASE-10106 URL: https://issues.apache.org/jira/browse/HBASE-10106 Project: HBase Issue Type: Test Affects Versions: 0.96.0 Reporter: Benoit Sigoure Priority: Trivial {code} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java index 7bc2a78..67b97ce 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java @@ -69,8 +69,6 @@ public class TestOpenTableInCoprocessor { public void prePut(final ObserverContextRegionCoprocessorEnvironment e, final Put put, final WALEdit edit, final Durability durability) throws IOException { HTableInterface table = e.getEnvironment().getTable(otherTable); - Put p = new Put(new byte[] { 'a' }); - p.add(family, null, new byte[] { 'a' }); table.put(put); table.flushCommits(); completed[0] = true; {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10000) Initiate lease recovery for outstanding WAL files at the very beginning of recovery
[ https://issues.apache.org/jira/browse/HBASE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842463#comment-13842463 ] Benoit Sigoure commented on HBASE-1: Damn I can't believe I missed issue #1. Congrats everyone for filing so many bugs! Initiate lease recovery for outstanding WAL files at the very beginning of recovery --- Key: HBASE-1 URL: https://issues.apache.org/jira/browse/HBASE-1 Project: HBase Issue Type: Improvement Reporter: Ted Yu Assignee: Ted Yu Fix For: 0.98.1 Attachments: 1-recover-ts-with-pb-2.txt, 1-recover-ts-with-pb-3.txt, 1-recover-ts-with-pb-4.txt, 1-recover-ts-with-pb-5.txt, 1-v4.txt, 1-v5.txt, 1-v6.txt At the beginning of recovery, master can send lease recovery requests concurrently for outstanding WAL files using a thread pool. Each split worker would first check whether the WAL file it processes is closed. Thanks to Nicolas Liochon and Jeffery discussion with whom gave rise to this idea. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9046) Some region servers keep using an older version of coprocessor
[ https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-9046: -- Affects Version/s: 0.96.0 Some region servers keep using an older version of coprocessor --- Key: HBASE-9046 URL: https://issues.apache.org/jira/browse/HBASE-9046 Project: HBase Issue Type: Bug Components: Coprocessors Affects Versions: 0.94.8, 0.96.0 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu Mar 31 21:46:45 PDT 2011 amd64 java version 1.6.0_07 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02) Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) hbase: 0.94.8, r1485407 hadoop: 1.0.4, r1393290 Reporter: iain wright Priority: Minor My team and another user from the mailing list have run into an issue where replacing the coprocessor jar in HDFS and reloading the table does not load the latest jar. It may load the latest version on some percentage of RS but not all of them. This may be a config oversight or a lack of understanding of a caching mechanism that has a purge capability, but I thought I would log it here for confirmation. Workaround is to name the coprocessor JAR uniquely, place in HDFS, and re-enable the table using the new jar's name. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9046) Some region servers keep using an older version of coprocessor
[ https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818526#comment-13818526 ] Benoit Sigoure commented on HBASE-9046: --- I think the problem is that {{CoprocessorClassLoader.classLoadersCache}} retains the previous cache loader in its cache. This is a cache that maps the path of the .jar file to its corresponding {{CoprocessorClassLoader}}. The values in the cache are weak references, but that doesn't guarantee that they will go away in a timely fashion. Therefore if you edit the schema of your table to unset the coprocessor and re-set it, most of the time you will get the same {{CoprocessorClassLoader}} as before and the new jar won't be loaded. I can reproduce this trivially and consistently on a single-node non-distributed HBase instance. Some region servers keep using an older version of coprocessor --- Key: HBASE-9046 URL: https://issues.apache.org/jira/browse/HBASE-9046 Project: HBase Issue Type: Bug Components: Coprocessors Affects Versions: 0.94.8, 0.96.0 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu Mar 31 21:46:45 PDT 2011 amd64 java version 1.6.0_07 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02) Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) hbase: 0.94.8, r1485407 hadoop: 1.0.4, r1393290 Reporter: iain wright Priority: Minor My team and another user from the mailing list have run into an issue where replacing the coprocessor jar in HDFS and reloading the table does not load the latest jar. It may load the latest version on some percentage of RS but not all of them. This may be a config oversight or a lack of understanding of a caching mechanism that has a purge capability, but I thought I would log it here for confirmation. Workaround is to name the coprocessor JAR uniquely, place in HDFS, and re-enable the table using the new jar's name. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9046) Some region servers keep using an older version of coprocessor
[ https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818530#comment-13818530 ] Benoit Sigoure commented on HBASE-9046: --- I can further confirm this because in my current environment I use a single coprocessor, so I devised a workaround for this bug: my coprocessor class has a {{static int}} I use as a reference count: every time my coprocessor's {{start}} is called, I increment it, and in {{stop}} I decrement it. In {{stop}}, when the count drops down to 0, I call {{CoprocessorClassLoader.clearCache()}}. This fixes the problem for me. This trick doesn't work for multiple co-processors, because {{clearCache()}} would clear everything. Also note that {{clearCache()}} is only exposed for testing purposes so it's technically not part of the public API. Another workaround I can think of (but haven't tried) would be to use reflection to access the underlying map and clear out the entry. I think the right way to fix this bug is to maintain the reference count manually by doing the increment/decrement from the {{startup()}} and {{shutdown()}} methods of {{CoprocessorHost$Environment}}. Some region servers keep using an older version of coprocessor --- Key: HBASE-9046 URL: https://issues.apache.org/jira/browse/HBASE-9046 Project: HBase Issue Type: Bug Components: Coprocessors Affects Versions: 0.94.8, 0.96.0 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu Mar 31 21:46:45 PDT 2011 amd64 java version 1.6.0_07 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02) Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) hbase: 0.94.8, r1485407 hadoop: 1.0.4, r1393290 Reporter: iain wright Priority: Minor My team and another user from the mailing list have run into an issue where replacing the coprocessor jar in HDFS and reloading the table does not load the latest jar. It may load the latest version on some percentage of RS but not all of them. This may be a config oversight or a lack of understanding of a caching mechanism that has a purge capability, but I thought I would log it here for confirmation. Workaround is to name the coprocessor JAR uniquely, place in HDFS, and re-enable the table using the new jar's name. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HBASE-9941) The context ClassLoader isn't set while calling into a coprocessor
Benoit Sigoure created HBASE-9941: - Summary: The context ClassLoader isn't set while calling into a coprocessor Key: HBASE-9941 URL: https://issues.apache.org/jira/browse/HBASE-9941 Project: HBase Issue Type: Bug Components: Coprocessors Affects Versions: 0.96.0 Reporter: Benoit Sigoure Whenever one of the methods of a coprocessor is invoked, the context {{ClassLoader}} isn't set to be the {{CoprocessorClassLoader}}. It's only set properly when calling the coprocessor's {{start}} method. This means that if the coprocessor code attempts to load classes using the context {{ClassLoader}}, it will fail to find the classes it's looking for. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HBASE-9936) Table get stuck when it fails to open due to a coprocessor error
Benoit Sigoure created HBASE-9936: - Summary: Table get stuck when it fails to open due to a coprocessor error Key: HBASE-9936 URL: https://issues.apache.org/jira/browse/HBASE-9936 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: Benoit Sigoure I made a mistake while after re-enabling a table on which I did an `alter' to add a coprocessor: the .jar I specified wasn't a self-contained jar, and thus some dependent classes couldn't be found. {code} 2013-11-09 02:39:05,994 INFO [AM.ZK.Worker-pool2-t17] master.RegionStates: Transitioned {8568640c1da6ce0d5e27b656d28fe9fd state=PENDING_OPEN, ts=1383993545988, server=192.168.42.108,59570,1383993435386} to {8568640c1da6ce0d5e27b656d28fe9fd state=OPENING, ts=1383993545994, server=192.168.42.108,59570,1383993435386} 2013-11-09 02:39:05,995 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] coprocessor.CoprocessorHost: Loading coprocessor class com.example.foo.hbase.FooCoprocessor with path /Users/tsuna/src/foo/target/scala-2.10/foo_2.10-0.1.jar and priority 1000 2013-11-09 02:39:06,005 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Finding class: com.example.foo.hbase.FooCoprocessor 2013-11-09 02:39:06,006 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Skipping exempt class org.apache.hadoop.hbase.coprocessor.BaseRegionObserver - delegating directly to parent 2013-11-09 02:39:06,007 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Skipping exempt class java.lang.Object - delegating directly to parent 2013-11-09 02:39:06,007 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Finding class: org.slf4j.LoggerFactory 2013-11-09 02:39:06,007 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Class org.slf4j.LoggerFactory not found - delegating to parent 2013-11-09 02:39:06,008 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Finding class: scala.collection.mutable.StringBuilder 2013-11-09 02:39:06,008 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Class scala.collection.mutable.StringBuilder not found - delegating to parent 2013-11-09 02:39:06,008 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] util.CoprocessorClassLoader: Class scala.collection.mutable.StringBuilder not found in parent loader 2013-11-09 02:39:06,008 ERROR [RS_OPEN_REGION-192.168.42.108:59570-2] handler.OpenRegionHandler: Failed open of region=foo,,1383899959121.8568640c1da6ce0d5e27b656d28fe9fd., starting to roll back the global memstore size. java.lang.IllegalStateException: Could not instantiate a region instance. at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3820) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4078) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4030) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3981) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:475) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:140) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3817) ... 9 more Caused by: java.lang.NoClassDefFoundError: scala/collection/mutable/StringBuilder at com.example.foo.hbase.FooCoprocessor.start(FooCoprocessor.scala:18) at org.apache.hadoop.hbase.coprocessor.CoprocessorHost$Environment.startup(CoprocessorHost.java:636) at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.loadInstance(CoprocessorHost.java:259) at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:212) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:192) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.init(RegionCoprocessorHost.java:154) at
[jira] [Created] (HBASE-9879) Can't undelete a KeyValue
Benoit Sigoure created HBASE-9879: - Summary: Can't undelete a KeyValue Key: HBASE-9879 URL: https://issues.apache.org/jira/browse/HBASE-9879 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: Benoit Sigoure Test scenario: put(KV, timestamp=100) put(KV, timestamp=200) delete(KV, timestamp=200, with MutationProto.DeleteType.DELETE_ONE_VERSION) get(KV) = returns value at timestamp=100 (OK) put(KV, timestamp=200) get(KV) = returns value at timestamp=100 (but not the one at timestamp=200 that was reborn by the previous put) Is that normal? I ran into this bug while running the integration tests at https://github.com/OpenTSDB/asynchbase/pull/60 – the first time you run it, it passes, but after that, it keeps failing. Sorry I don't have the corresponding HTable-based code but that should be fairly easy to write. I only tested this with 0.96.0, dunno yet how this behaved in prior releases. My hunch is that the tombstone added by the DELETE_ONE_VERSION keeps shadowing the value even after it's reborn. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807827#comment-13807827 ] Benoit Sigoure commented on HBASE-5539: --- It doesn't look like this was ever committed. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: Performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Fix For: 0.95.0 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt, 5539-asynchbase-PerformanceEvaluation-v5.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-5539: -- Attachment: 0001-AsyncHBase-PerformanceEvaluation.patch Patch for 0.96 of the changes that fell through the cracks. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: Performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Fix For: 0.95.0 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 0001-AsyncHBase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt, 5539-asynchbase-PerformanceEvaluation-v5.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (HBASE-9710) Use the region name, not the encoded name, in exceptions
[ https://issues.apache.org/jira/browse/HBASE-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure reassigned HBASE-9710: - Assignee: Benoit Sigoure Use the region name, not the encoded name, in exceptions Key: HBASE-9710 URL: https://issues.apache.org/jira/browse/HBASE-9710 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.2, 0.96.0 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} we put the encoded region name in the exception, which isn't super useful. I propose putting the region name instead. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HBASE-9710) Use the region name, not the encoded name, in exceptions
Benoit Sigoure created HBASE-9710: - Summary: Use the region name, not the encoded name, in exceptions Key: HBASE-9710 URL: https://issues.apache.org/jira/browse/HBASE-9710 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.2, 0.96.0 Reporter: Benoit Sigoure Priority: Minor Attachments: 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} we put the encoded region name in the exception, which isn't super useful. I propose putting the region name instead. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9710) Use the region name, not the encoded name, in exceptions
[ https://issues.apache.org/jira/browse/HBASE-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-9710: -- Status: Patch Available (was: Open) Use the region name, not the encoded name, in exceptions Key: HBASE-9710 URL: https://issues.apache.org/jira/browse/HBASE-9710 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.2, 0.96.0 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} we put the encoded region name in the exception, which isn't super useful. I propose putting the region name instead. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9710) Use the region name, not the encoded name, in exceptions
[ https://issues.apache.org/jira/browse/HBASE-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-9710: -- Attachment: 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch Proposed patch. Use the region name, not the encoded name, in exceptions Key: HBASE-9710 URL: https://issues.apache.org/jira/browse/HBASE-9710 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.95.2, 0.96.0 Reporter: Benoit Sigoure Priority: Minor Attachments: 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} we put the encoded region name in the exception, which isn't super useful. I propose putting the region name instead. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9612) Ability to batch edits destined to different regions
[ https://issues.apache.org/jira/browse/HBASE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784832#comment-13784832 ] Benoit Sigoure commented on HBASE-9612: --- Sorry to see the far reaching consequences this change has. If we were to re-do this from scratch (so assuming that the batch call didn't exist), would you have a multi RPC that does only-edits (instead of a mix of edits and gets) because that would be simpler? I don't have a strong feeling on mixing edits and gets, but I believe being able to batch edits across regions in one RPC call is pretty important. Ability to batch edits destined to different regions Key: HBASE-9612 URL: https://issues.apache.org/jira/browse/HBASE-9612 Project: HBase Issue Type: Bug Affects Versions: 0.95.0, 0.95.1, 0.95.2, 0.96.0 Reporter: Benoit Sigoure Assignee: stack Priority: Critical Fix For: 0.98.0, 0.96.0 Attachments: 0001-fix-packaging-by-region-in-MultiServerCallable.patch, 9612.096.v5.txt, 9612revert.txt, 9612v2.txt, 9612v3.txt, 9612v4.txt, 9612v5.txt, 9612v5.txt, 9612v5.txt, 9612v7.txt, 9612v8.096.txt, 9612v8.txt, 9612v9.txt, 9612v9.txt, 9612.wip.txt The old (pre-PB) multi and multiPut RPCs allowed one to batch edits destined to different regions. Seems like we've lost this ability after the switch to protobufs. The {{MultiRequest}} only contains one {{RegionSpecifier}}, and a list of {{MultiAction}}. The {{MultiAction}} message is contains either a single {{MutationProto}} or a {{Get}} (but not both – so its name is misleading as there is nothing multi about it). Also it seems redundant with {{MultiGetRequest}}, I'm not sure what's the point of supporting {{Get}} in {{MultiAction}}. I propose that we change {{MultiRequest}} to be a just a list of {{MultiAction}}, and {{MultiAction}} will contain the {{RegionSpecifier}}, the {{bool atomic}} and a list of {{MutationProto}}. This would be a non-backward compatible protobuf change. If we want we can support mixing edits and reads, in which case we'd also add a list of {{Get}} in {{MultiAction}}, and we'd have support having both that list and the list of {{MutationProto}} set at the same time. But this is a bonus and can be done later (in a backward compatible manner, hence no need to rush on this one). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HBASE-9612) Ability to batch edits destined to different regions
Benoit Sigoure created HBASE-9612: - Summary: Ability to batch edits destined to different regions Key: HBASE-9612 URL: https://issues.apache.org/jira/browse/HBASE-9612 Project: HBase Issue Type: Bug Affects Versions: 0.95.2, 0.95.1, 0.95.0, 0.96.0 Reporter: Benoit Sigoure The old (pre-PB) multi and multiPut RPCs allowed one to batch edits destined to different regions. Seems like we've lost this ability after the switch to protobufs. The {{MultiRequest}} only contains one {{RegionSpecifier}}, and a list of {{MultiAction}}. The {{MultiAction}} message is contains either a single {{MutationProto}} or a {{Get}} (but not both – so its name is misleading as there is nothing multi about it). Also it seems redundant with {{MultiGetRequest}}, I'm not sure what's the point of supporting {{Get}} in {{MultiAction}}. I propose that we change {{MultiRequest}} to be a just a list of {{MultiAction}}, and {{MultiAction}} will contain the {{RegionSpecifier}}, the {{bool atomic}} and a list of {{MutationProto}}. This would be a non-backward compatible protobuf change. If we want we can support mixing edits and reads, in which case we'd also add a list of {{Get}} in {{MultiAction}}, and we'd have support having both that list and the list of {{MutationProto}} set at the same time. But this is a bonus and can be done later (in a backward compatible manner, hence no need to rush on this one). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8958) Sometimes we refer to the single .META. table region as .META.,,1 and other times as .META.,,1.1028785192
[ https://issues.apache.org/jira/browse/HBASE-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719349#comment-13719349 ] Benoit Sigoure commented on HBASE-8958: --- I noticed that if I refer to META as {{.META.,,1.1028785192}} instead of {{.META.,,1}} it doesn't work (the RS sends back an NSRE). Sometimes we refer to the single .META. table region as .META.,,1 and other times as .META.,,1.1028785192 -- Key: HBASE-8958 URL: https://issues.apache.org/jira/browse/HBASE-8958 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.95.2 See here how we say in a log: {code} 2013-07-15 22:32:53,805 INFO [main] regionserver.HRegion(4176): Open {ENCODED = 1028785192, NAME = '.META.,,1', STARTKEY = '', ENDKEY = ''} {code} but when we open other regions we do: {code} 764 2013-07-15 22:40:10,867 INFO [RS_OPEN_REGION-durruti:61987-0] regionserver.HRegion: Open {ENCODED = 93dad2bbf6ff5ea0d7477f504b303346, NAME = 'x,,1373953210791.93dad2bbf6ff5ea0d7477f504b303346.', ... {code} Note how in the second, the name includes the encoded name. We'll also do : {code} 2013-07-15 22:32:53,810 INFO [main] regionserver.HRegion(629): Onlined 1028785192/.META.; next sequenceid=1 {code} vs {code} 785 2013-07-15 22:40:10,885 INFO [AM.ZK.Worker-pool-2-thread-7] master.RegionStates: Onlined 93dad2bbf6ff5ea0d7477f504b303346 on durruti,61987,1373947581222 {code} ... where we print the encoded name. Master web UI shows .META.,,1.1028785192 Benoit originally noticed this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9006) RPC code requires cellBlockCodecClass even though one is defined by default
Benoit Sigoure created HBASE-9006: - Summary: RPC code requires cellBlockCodecClass even though one is defined by default Key: HBASE-9006 URL: https://issues.apache.org/jira/browse/HBASE-9006 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.95.1 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor The protobuf definition provides a default value: {code} // This is sent on connection setup after the connection preamble is sent. message ConnectionHeader { [...] optional string cellBlockCodecClass = 3 [default = org.apache.hadoop.hbase.codec.KeyValueCodec]; // Compressor we will use if cell block is compressed. Server will throw exception if not supported. // Class must implement hadoop's CompressionCodec Interface [...] } {code} Yet if one doesn't explicitly set a value, the code was rejecting the connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9006) RPC code requires cellBlockCodecClass even though one is defined by default
[ https://issues.apache.org/jira/browse/HBASE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-9006: -- Attachment: 0001-Don-t-require-a-cellBlockCodecClass-in-the-RPC-heade.patch Patch that fixes the issue. RPC code requires cellBlockCodecClass even though one is defined by default --- Key: HBASE-9006 URL: https://issues.apache.org/jira/browse/HBASE-9006 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.95.1 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: 0001-Don-t-require-a-cellBlockCodecClass-in-the-RPC-heade.patch The protobuf definition provides a default value: {code} // This is sent on connection setup after the connection preamble is sent. message ConnectionHeader { [...] optional string cellBlockCodecClass = 3 [default = org.apache.hadoop.hbase.codec.KeyValueCodec]; // Compressor we will use if cell block is compressed. Server will throw exception if not supported. // Class must implement hadoop's CompressionCodec Interface [...] } {code} Yet if one doesn't explicitly set a value, the code was rejecting the connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9006) RPC code requires cellBlockCodecClass even though one is defined by default
[ https://issues.apache.org/jira/browse/HBASE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-9006: -- Status: Patch Available (was: Open) RPC code requires cellBlockCodecClass even though one is defined by default --- Key: HBASE-9006 URL: https://issues.apache.org/jira/browse/HBASE-9006 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.95.1 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: 0001-Don-t-require-a-cellBlockCodecClass-in-the-RPC-heade.patch The protobuf definition provides a default value: {code} // This is sent on connection setup after the connection preamble is sent. message ConnectionHeader { [...] optional string cellBlockCodecClass = 3 [default = org.apache.hadoop.hbase.codec.KeyValueCodec]; // Compressor we will use if cell block is compressed. Server will throw exception if not supported. // Class must implement hadoop's CompressionCodec Interface [...] } {code} Yet if one doesn't explicitly set a value, the code was rejecting the connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9001) TestThriftServerCmdLine.testRunThriftServer[0] failed
[ https://issues.apache.org/jira/browse/HBASE-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714369#comment-13714369 ] Benoit Sigoure commented on HBASE-9001: --- IT'S OVER 9000!!! TestThriftServerCmdLine.testRunThriftServer[0] failed - Key: HBASE-9001 URL: https://issues.apache.org/jira/browse/HBASE-9001 Project: HBase Issue Type: Bug Components: test Reporter: stack Fix For: 0.95.2 Attachments: 9001.txt https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/624/testReport/junit/org.apache.hadoop.hbase.thrift/TestThriftServerCmdLine/testRunThriftServer_0_/ It seems stuck here: {code} 2013-07-19 03:52:03,158 INFO [Thread-131] thrift.TestThriftServerCmdLine(132): Starting HBase Thrift server with command line: -hsha -port 56708 start 2013-07-19 03:52:03,174 INFO [ThriftServer-cmdline] thrift.ThriftServerRunner$ImplType(208): Using thrift server type hsha 2013-07-19 03:52:03,205 WARN [ThriftServer-cmdline] conf.Configuration(817): fs.default.name is deprecated. Instead, use fs.defaultFS 2013-07-19 03:52:03,206 WARN [ThriftServer-cmdline] conf.Configuration(817): mapreduce.job.counters.limit is deprecated. Instead, use mapreduce.job.counters.max 2013-07-19 03:52:03,207 WARN [ThriftServer-cmdline] conf.Configuration(817): io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2013-07-19 03:54:03,156 INFO [pool-1-thread-1] hbase.ResourceChecker(171): after: thrift.TestThriftServerCmdLine#testRunThriftServer[0] Thread=146 (was 155), OpenFileDescriptor=295 (was 311), MaxFileDescriptor=4096 (was 4096), SystemLoadAverage=293 (was 240) - SystemLoadAverage LEAK? -, ProcessCount=145 (was 143) - ProcessCount LEAK? -, AvailableMemoryMB=779 (was 1263), ConnectionCount=4 (was 4) 2013-07-19 03:54:03,157 DEBUG [pool-1-thread-1] thrift.TestThriftServerCmdLine(107): implType=-hsha, specifyFramed=false, specifyBindIP=false, specifyCompact=true {code} My guess is that we didn't get scheduled because load was almost 300 on this box at the time? Let me up the timeout of two minutes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever
[ https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-8952: -- Attachment: HBASE-8952.patch Patch to fix the issue in the 0.95 branch. Missing error handling can cause RegionServer RPC thread to busy loop forever - Key: HBASE-8952 URL: https://issues.apache.org/jira/browse/HBASE-8952 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: Benoit Sigoure Assignee: Benoit Sigoure Attachments: HBASE-8952.patch This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever
Benoit Sigoure created HBASE-8952: - Summary: Missing error handling can cause RegionServer RPC thread to busy loop forever Key: HBASE-8952 URL: https://issues.apache.org/jira/browse/HBASE-8952 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: Benoit Sigoure Assignee: Benoit Sigoure Attachments: HBASE-8952.patch This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever
[ https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708874#comment-13708874 ] Benoit Sigoure commented on HBASE-8952: --- I attached a patch that fixes the issue for me in 0.95 – the patch would need to be ported to other branches as well as to the secure RPC implementation, which is separate in 0.94 ({{ipc/SecureServer.java}}). Missing error handling can cause RegionServer RPC thread to busy loop forever - Key: HBASE-8952 URL: https://issues.apache.org/jira/browse/HBASE-8952 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: Benoit Sigoure Assignee: Benoit Sigoure Attachments: HBASE-8952.patch This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever
[ https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-8952: -- Status: Patch Available (was: Open) Missing error handling can cause RegionServer RPC thread to busy loop forever - Key: HBASE-8952 URL: https://issues.apache.org/jira/browse/HBASE-8952 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: Benoit Sigoure Assignee: Benoit Sigoure Attachments: HBASE-8952.patch This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever
[ https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-8952: -- Description: If the connection to the client is closed unexpectedly and at the wrong time, the code will attempt to keep reading from the socket in a busy loop. This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches, however I only ran into it while porting AsyncHBase to 0.95 was:This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches. Missing error handling can cause RegionServer RPC thread to busy loop forever - Key: HBASE-8952 URL: https://issues.apache.org/jira/browse/HBASE-8952 Project: HBase Issue Type: Bug Components: IPC/RPC Reporter: Benoit Sigoure Assignee: Benoit Sigoure Attachments: HBASE-8952.patch If the connection to the client is closed unexpectedly and at the wrong time, the code will attempt to keep reading from the socket in a busy loop. This bug seems to be present in all released versions of HBase, including the tip of the 0.94 and 0.95 branches, however I only ran into it while porting AsyncHBase to 0.95 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process
[ https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641541#comment-13641541 ] Benoit Sigoure commented on HBASE-6290: --- Yes, that's right. It would be great to have this sort of kill switch, both at the HBase level, as well as HDFS. The feature I presented works especially well to tell all interested parties (clients) that the node they're trying to reach is dead, but often it doesn't help time out the node out of the cluster, e.g. in HDFS or MapReduce, the NameNode and JobTracker will ignore TCP resets and will not flag the node as really dead until some long pre-configured timeout elapses. Add a function a mark a server as dead and start the recovery the process - Key: HBASE-6290 URL: https://issues.apache.org/jira/browse/HBASE-6290 Project: HBase Issue Type: Improvement Components: monitoring Affects Versions: 0.95.2 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Minor Labels: noob ZooKeeper is used a a monitoring tool: we use znode and we start the recovery process when a znode is deleted by ZK because it got a timeout. This timeout is defaulted to 90 seconds, and often set to 30s However, some HW issues could be detected by specialized hw monitoring tools before the ZK timeout. For this reason, it makes sense to offer a very simple function to mark a RS as dead. This should not take in It could be a hbase shell function such as considerAsDead ipAddress|serverName This would delete all the znodes of the server running on this box, starting the recovery process. Such a function would be easily callable (at callers risk) by any fault detection tool... We could have issues to identify the right master region servers around ipv4 vs ipv6 vs and multi networked boxes however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6504) Adding GC details prevents HBase from starting in non-distributed mode
[ https://issues.apache.org/jira/browse/HBASE-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456124#comment-13456124 ] Benoit Sigoure commented on HBASE-6504: --- Yeah. One minor nit though: the form {{head -1}} is deprecated (and has been for years). Better to use {{head -n 1}}. Adding GC details prevents HBase from starting in non-distributed mode -- Key: HBASE-6504 URL: https://issues.apache.org/jira/browse/HBASE-6504 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Benoit Sigoure Assignee: Michael Drzal Priority: Trivial Labels: noob Attachments: HBASE-6504-output.txt, HBASE-6504.patch The {{conf/hbase-env.sh}} that ships with HBase contains a few commented out examples of variables that could be useful, such as adding {{-XX:+PrintGCDetails -XX:+PrintGCDateStamps}} to {{HBASE_OPTS}}. This has the annoying side effect that the JVM prints a summary of memory usage when it exits, and it does so on stdout: {code} $ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed false Heap par new generation total 19136K, used 4908K [0x00073a20, 0x00073b6c, 0x00075186) eden space 17024K, 28% used [0x00073a20, 0x00073a6cb0a8, 0x00073b2a) from space 2112K, 0% used [0x00073b2a, 0x00073b2a, 0x00073b4b) to space 2112K, 0% used [0x00073b4b, 0x00073b4b, 0x00073b6c) concurrent mark-sweep generation total 63872K, used 0K [0x00075186, 0x0007556c, 0x0007f5a0) concurrent-mark-sweep perm gen total 21248K, used 6994K [0x0007f5a0, 0x0007f6ec, 0x0008) $ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed /dev/null (nothing printed) {code} And this confuses {{bin/start-hbase.sh}} when it does {{distMode=`$bin/hbase --config $HBASE_CONF_DIR org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed`}}, because then the {{distMode}} variable is not just set to {{false}}, it also contains all this JVM spam. If you don't pay enough attention and realize that 3 processes are getting started (ZK, HM, RS) instead of just one (HM), then you end up with this confusing error message: {{Could not start ZK at requested port of 2181. ZK was started at port: 2182. Aborting as clients (e.g. shell) will not be able to find this ZK quorum.}}, which is even more puzzling because when you run {{netstat}} to see who owns that port, then you won't find any rogue process other than the one you just started. I'm wondering if the fix is not to just change the {{if [ $distMode == 'false' ]}} to a {{switch $distMode case (false*)}} type of test, to work around this annoying JVM misfeature that pollutes stdout. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6586) Quarantine Corrupted HFiles
[ https://issues.apache.org/jira/browse/HBASE-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440930#comment-13440930 ] Benoit Sigoure commented on HBASE-6586: --- Not sure why {{HBaseIOException}} would be added in a JIRA about Quarantine Corrupted HFiles, but yes that would be nice to have. Quarantine Corrupted HFiles --- Key: HBASE-6586 URL: https://issues.apache.org/jira/browse/HBASE-6586 Project: HBase Issue Type: Improvement Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Attachments: 0001-hbase-6568-hbck-quarantine-v6.patch, hbase-6586-92-v3.patch, hbase-6586-92-v8.patch, hbase-6586-94-v3.patch, hbase-6586-94-v8.patch, hbase-6586.patch, hbase-6586-trunk-v3.patch, hbase-6586-trunk-v4.patch, hbase-6586-trunk-v5.patch, hbase-6586-trunk-v6.patch, hbase-6586-trunk-v7.patch, hbase-6586-trunk-v8.patch We've encountered a few upgrades from 0.90 hbases + 20.2/1.x hdfs to 0.92 hbases + hdfs 2.x that get stuck. I haven't been able to duplicate the problem in my dev environment but we suspect this may be related to HDFS-3731. On the HBase side, it seems reasonable to quarantine what are most likely truncated hfiles, so that can could later be recovered. Here's an example of the exception we've encountered: {code} 2012-07-18 05:55:01,152 ERROR handler.OpenRegionHandler (OpenRegionHandler.java:openRegion(346)) - Failed open of region=user_mappings,080112102AA76EF98197605D341B9E6C5824D2BC|1001,1317824890618.eaed0e7abc6d27d28ff0e5a9b49c4c 0d. java.io.IOException: java.lang.IllegalArgumentException: Invalid HFile version: 842220600 (expected to be between 1 and 2) at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:306) at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:371) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:387) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1026) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:485) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:566) at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:286) at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:223) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2534) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:454) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3282) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3230) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:331) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:107) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.IllegalArgumentException: Invalid HFile version: 842220600 (expected to be between 1 and 2) at org.apache.hadoop.hbase.io.hfile.HFile.checkFormatVersion(HFile.java:515) at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:303) ... 17 more {code} Specifically -- the FixedFileTrailer are incorrect, and seemingly missing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6504) Adding GC details prevents HBase from starting in non-distributed mode
Benoit Sigoure created HBASE-6504: - Summary: Adding GC details prevents HBase from starting in non-distributed mode Key: HBASE-6504 URL: https://issues.apache.org/jira/browse/HBASE-6504 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Benoit Sigoure Priority: Trivial The {{conf/hbase-env.sh}} that ships with HBase contains a few commented out examples of variables that could be useful, such as adding {{-XX:+PrintGCDetails -XX:+PrintGCDateStamps}} to {{HBASE_OPTS}}. This has the annoying side effect that the JVM prints a summary of memory usage when it exits, and it does so on stdout: {code} $ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed false Heap par new generation total 19136K, used 4908K [0x00073a20, 0x00073b6c, 0x00075186) eden space 17024K, 28% used [0x00073a20, 0x00073a6cb0a8, 0x00073b2a) from space 2112K, 0% used [0x00073b2a, 0x00073b2a, 0x00073b4b) to space 2112K, 0% used [0x00073b4b, 0x00073b4b, 0x00073b6c) concurrent mark-sweep generation total 63872K, used 0K [0x00075186, 0x0007556c, 0x0007f5a0) concurrent-mark-sweep perm gen total 21248K, used 6994K [0x0007f5a0, 0x0007f6ec, 0x0008) $ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed /dev/null (nothing printed) {code} And this confuses {{bin/start-hbase.sh}} when it does {{distMode=`$bin/hbase --config $HBASE_CONF_DIR org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed`}}, because then the {{distMode}} variable is not just set to {{false}}, it also contains all this JVM spam. If you don't pay enough attention and realize that 3 processes are getting started (ZK, HM, RS) instead of just one (HM), then you end up with this confusing error message: {{Could not start ZK at requested port of 2181. ZK was started at port: 2182. Aborting as clients (e.g. shell) will not be able to find this ZK quorum.}}, which is even more puzzling because when you run {{netstat}} to see who owns that port, then you won't find any rogue process other than the one you just started. I'm wondering if the fix is not to just change the {{if [ $distMode == 'false' ]}} to a {{switch $distMode case (false*)}} type of test, to work around this annoying JVM misfeature that pollutes stdout. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-2877) Unnecessary byte written when serializing a Writable RPC parameter
[ https://issues.apache.org/jira/browse/HBASE-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-2877: -- Affects Version/s: 0.90.0 0.90.1 0.90.2 0.90.3 0.90.4 0.90.5 0.90.6 0.92.0 0.92.1 0.94.0 Unnecessary byte written when serializing a Writable RPC parameter -- Key: HBASE-2877 URL: https://issues.apache.org/jira/browse/HBASE-2877 Project: HBase Issue Type: Bug Components: ipc Affects Versions: 0.20.5, 0.89.20100621, 0.90.0, 0.90.1, 0.90.2, 0.90.3, 0.90.4, 0.90.5, 0.90.6, 0.92.0, 0.92.1, 0.94.0 Reporter: Benoit Sigoure Priority: Minor When {{HbaseObjectWritable#writeObject}} serializes a {{Writable}} RPC parameter, it writes its class code twice to the wire. {{writeClassCode}} is already called once unconditionally at the beginning of the method, and for {{Writable}} arguments, it's called a second time towards the end of the method. It seems that the code is trying to deal with the declared type vs. actual type of a parameter. The Hadoop RPC code was already doing this before Stack changed it to use codes in r608738 for HADOOP-2519. It's not documented when this is useful though, and I couldn't find any use case. Every RPC I've seen so far just ends up with the same byte sent twice to the wire. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6239) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row
[ https://issues.apache.org/jira/browse/HBASE-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412564#comment-13412564 ] Benoit Sigoure commented on HBASE-6239: --- This means HBase replication will still corrupt timestamps in 0.90.7, which in many cases makes replication useless. Are you sure? [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row --- Key: HBASE-6239 URL: https://issues.apache.org/jira/browse/HBASE-6239 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Critical Labels: corruption Fix For: 0.92.2, 0.90.8 Attachments: HBASE-6239-0.92-v1.patch ReplicationSink assumes that all the KVs for the same row inside a WALEdit will have the same timestamp, which is not necessarily the case. This only affects 0.90 and 0.92 since HBASE-5203 fixes it in 0.94 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6239) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row
[ https://issues.apache.org/jira/browse/HBASE-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399417#comment-13399417 ] Benoit Sigoure commented on HBASE-6239: --- I would argue that this bug is not minor, because we're talking about data being corrupted by HBase. [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row --- Key: HBASE-6239 URL: https://issues.apache.org/jira/browse/HBASE-6239 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Minor Labels: corruption Fix For: 0.92.2 Attachments: HBASE-6239-0.92-v1.patch ReplicationSink assumes that all the KVs for the same row inside a WALEdit will have the same timestamp, which is not necessarily the case. This only affects 0.90 and 0.92 since HBASE-5203 fixes it in 0.94 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398880#comment-13398880 ] Benoit Sigoure commented on HBASE-5539: --- +1, thanks. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398982#comment-13398982 ] Benoit Sigoure commented on HBASE-5539: --- Yeah someone first needs to commit HBASE-5539. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399011#comment-13399011 ] Benoit Sigoure commented on HBASE-5539: --- Err... I meant this issue depends on HBASE-5527 to be committed first. That's why you were missing the method {{getNumClientThreads()}} earlier. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399112#comment-13399112 ] Benoit Sigoure commented on HBASE-5539: --- Thanks for the commit. In the future please keep the changes in separate commits if they are in separate issues. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch, 5539-asynchbase-PerformanceEvaluation-v2.txt, 5539-asynchbase-PerformanceEvaluation-v3.txt, 5539-asynchbase-PerformanceEvaluation-v4.txt, 5539-asynchbase-PerformanceEvaluation-v5.txt I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5527) PerformanceEvaluation: Report aggregate timings on a single line
[ https://issues.apache.org/jira/browse/HBASE-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399113#comment-13399113 ] Benoit Sigoure commented on HBASE-5527: --- Using {{nanoTime}} isn't only nice, it's also more correct :) {{currentTimeMillis}} depends on system time and is not monotonic, whereas {{nanoTime}} is almost always implemented with a proper monotic clock (although I think technically this isn't _guaranteed_, but in practice on all reasonable platforms, it's the case). PerformanceEvaluation: Report aggregate timings on a single line Key: HBASE-5527 URL: https://issues.apache.org/jira/browse/HBASE-5527 Project: HBase Issue Type: Improvement Components: performance Affects Versions: 0.92.0 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: 0001-PerformanceEvaluation-fixes.patch, 0001-PerformanceEvaluation-fixes.patch When running {{PerformanceEvaluation}} with {{--nomapred}} it's hard to locate all the lines saying {{Finished 14 in 292979ms writing 200 rows}} in the output. This change adds a couples line to summarize the run at the end, which makes parsing and scripting the output easier: {code} 12/03/06 00:43:58 INFO hbase.PerformanceEvaluation: [RandomWriteTest] Summary of timings (ms): [15940, 15776, 15866, 15973, 15682, 15740, 15764, 15830, 15768, 15968, 15921, 15755, 15963, 15818, 15903, 15662] 12/03/06 00:43:58 INFO hbase.PerformanceEvaluation: [RandomWriteTest] Min: 15662msMax: 15973msAvg: 15833ms {code} Patch also removes a couple minor code smells. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6239) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row
[ https://issues.apache.org/jira/browse/HBASE-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-6239: -- Labels: corruption (was: ) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row --- Key: HBASE-6239 URL: https://issues.apache.org/jira/browse/HBASE-6239 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Minor Labels: corruption Fix For: 0.92.2 Attachments: HBASE-6239-0.92-v1.patch ReplicationSink assumes that all the KVs for the same row inside a WALEdit will have the same timestamp, which is not necessarily the case. This only affects 0.90 and 0.92 since HBASE-5203 fixes it in 0.94 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3170) RegionServer confused about empty row keys
[ https://issues.apache.org/jira/browse/HBASE-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-3170: -- Affects Version/s: 0.90.0 0.90.1 0.90.2 0.90.3 0.90.4 0.90.5 0.90.6 0.92.0 0.92.1 RegionServer confused about empty row keys -- Key: HBASE-3170 URL: https://issues.apache.org/jira/browse/HBASE-3170 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.89.20100621, 0.89.20100924, 0.90.0, 0.90.1, 0.90.2, 0.90.3, 0.90.4, 0.90.5, 0.90.6, 0.92.0, 0.92.1 Reporter: Benoit Sigoure I'm no longer sure about the expected behavior when using an empty row key (e.g. a 0-byte long byte array). I assumed that this was a legitimate row key, just like having an empty column qualifier is allowed. But it seems that the RegionServer considers the empty row key to be whatever the first row key is. {code} Version: 0.89.20100830, r0da2890b242584a8a5648d83532742ca7243346b, Sat Sep 18 15:30:09 PDT 2010 hbase(main):001:0 scan 'tsdb-uid', {LIMIT = 1} ROW COLUMN+CELL \x00 column=id:metrics, timestamp=1288375187699, value=foo \x00 column=id:tagk, timestamp=1287522021046, value=bar \x00 column=id:tagv, timestamp=1288111387685, value=qux 1 row(s) in 0.4610 seconds hbase(main):002:0 get 'tsdb-uid', '' COLUMNCELL id:metrics timestamp=1288375187699, value=foo id:tagk timestamp=1287522021046, value=bar id:tagv timestamp=1288111387685, value=qux 3 row(s) in 0.0910 seconds hbase(main):003:0 get 'tsdb-uid', \000 COLUMNCELL id:metrics timestamp=1288375187699, value=foo id:tagk timestamp=1287522021046, value=bar id:tagv timestamp=1288111387685, value=qux 3 row(s) in 0.0550 seconds {code} This isn't a parsing problem with the command-line of the shell. I can reproduce this behavior both with plain Java code and with my asynchbase client. Since I don't actually have a row with an empty row key, I expected that the first {{get}} would return nothing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-5539: -- Attachment: 0001-asynchbase-PerformanceEvaluation.patch Updated patch with new {{pom.xml}}. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-5539: -- Status: Patch Available (was: Open) asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 0001-asynchbase-PerformanceEvaluation.patch I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13275887#comment-13275887 ] Benoit Sigoure commented on HBASE-5539: --- Yeah I have preliminary results at http://goo.gl/mZAcK – it shows that asynchbase can be quite a bit faster than {{HTable}}, surprisingly perhaps especially for read-heavy workloads, as well as for write-heavy workloads with many threads, where {{HTable}} suffers from really poor concurrency. BTW, asynchbase 1.3.0 has been released, so the patch I attached originally to this issue needs to be updated to change the dependency to be on that version. I'll post a new patch soon, unless someone beats me to it. asynchbase PerformanceEvaluation Key: HBASE-5539 URL: https://issues.apache.org/jira/browse/HBASE-5539 Project: HBase Issue Type: New Feature Components: performance Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Labels: benchmark Attachments: 0001-asynchbase-PerformanceEvaluation.patch I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into {{PerformanceEvaluation}}. This enables testing asynchbase from {{PerformanceEvaluation}} and comparing its performance to {{HTable}}. Also asynchbase doesn't come with any benchmark, so it was good that I was able to plug it into {{PerformanceEvaluation}} relatively easily. I am in the processing of collecting results on a dev cluster running 0.92.1 and will publish them once they're ready. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2489) Make the Filesystem needs to be upgraded error message more useful.
[ https://issues.apache.org/jira/browse/HBASE-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259035#comment-13259035 ] Benoit Sigoure commented on HBASE-2489: --- @chenning, this isn't a good place to ask for support, send an email to the user mailing list. At this point you should be using a version of HBase that's much newer than the one in which this fix was made, so you shouldn't need to apply this fix. Make the Filesystem needs to be upgraded error message more useful. - Key: HBASE-2489 URL: https://issues.apache.org/jira/browse/HBASE-2489 Project: HBase Issue Type: Improvement Components: util Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Trivial Fix For: 0.90.0 Attachments: 0001-Improve-the-error-message-File-system-needs-to-be-up.patch The other day, when starting HBase I got this error: {noformat} 2010-04-23 09:38:14,847 ERROR org.apache.hadoop.hbase.master.HMaster: Failed to start master org.apache.hadoop.hbase.util.FileSystemVersionException: File system needs to be upgraded. Run the '${HBASE_HOME}/bin/hbase migrate' script. at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:187) {noformat} I was puzzled until I realized, after adding extra debug statements in the code, that I forgot to properly set {{hbase.rootdir}} after re-deploying my dev environment. I think the message above was misleading and I'm proposing a trivial patch to make it a little bit better: {noformat} 2010-04-23 09:48:29,000 ERROR org.apache.hadoop.hbase.master.HMaster: Failed to start master org.apache.hadoop.hbase.util.FileSystemVersionException: File system needs to be upgraded. You have version null and I want version 7. Run the '${HBASE_HOME}/bin/hbase migrate' script. at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:189) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3581) hbase rpc should send size of response
[ https://issues.apache.org/jira/browse/HBASE-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13105894#comment-13105894 ] Benoit Sigoure commented on HBASE-3581: --- +1, thank you Stack. hbase rpc should send size of response -- Key: HBASE-3581 URL: https://issues.apache.org/jira/browse/HBASE-3581 Project: HBase Issue Type: Improvement Reporter: ryan rawson Assignee: stack Priority: Critical Fix For: 0.92.0 Attachments: 3581-v2.txt, HBASE-rpc-response.txt The RPC reply from Server-Client does not include the size of the payload, it is framed like so: i32 callId byte errorFlag byte[] data The data segment would contain enough info about how big the response is so that it could be decoded by a writable reader. This makes it difficult to write buffering clients, who might read the entire 'data' then pass it to a decoder. While less memory efficient, if you want to easily write block read clients (eg: nio) it would be necessary to send the size along so that the client could snarf into a local buf. The new proposal is: i32 callId i32 size byte errorFlag byte[] data the size being sizeof(data) + sizeof(errorFlag). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4236) Don't lock the stream while serializing the response
Don't lock the stream while serializing the response Key: HBASE-4236 URL: https://issues.apache.org/jira/browse/HBASE-4236 Project: HBase Issue Type: Improvement Components: ipc Affects Versions: 0.90.4 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor It is not necessary to hold the lock on the stream while the response is being serialized. This unnecessarily prevents serializing responses in parallel. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4237) Directly remove the call being handled from the map of outstanding RPCs
Directly remove the call being handled from the map of outstanding RPCs --- Key: HBASE-4237 URL: https://issues.apache.org/jira/browse/HBASE-4237 Project: HBase Issue Type: Improvement Components: ipc Affects Versions: 0.90.4 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor The client has to maintain a map of RPC ID to `Call' object for this RPC, for every outstanding RPC. When receiving a response, the client was getting the `Call' out of the map (one O(log n) operation) and then removing it from the map (another O(log n) operation). There is no benefit in not removing it directly from the map. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4237) Directly remove the call being handled from the map of outstanding RPCs
[ https://issues.apache.org/jira/browse/HBASE-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088299#comment-13088299 ] Benoit Sigoure commented on HBASE-4237: --- Patch @ https://github.com/tsuna/hbase/commit/1f602391ee4cd3d11eaf3067208caeadf214b3a8 Directly remove the call being handled from the map of outstanding RPCs --- Key: HBASE-4237 URL: https://issues.apache.org/jira/browse/HBASE-4237 Project: HBase Issue Type: Improvement Components: ipc Affects Versions: 0.90.4 Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor The client has to maintain a map of RPC ID to `Call' object for this RPC, for every outstanding RPC. When receiving a response, the client was getting the `Call' out of the map (one O(log n) operation) and then removing it from the map (another O(log n) operation). There is no benefit in not removing it directly from the map. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-2321) Support RPC interface changes at runtime
[ https://issues.apache.org/jira/browse/HBASE-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-2321: -- Hadoop Flags: [Incompatible change, Reviewed] (was: [Reviewed]) This breaks RPC compatibility. Support RPC interface changes at runtime Key: HBASE-2321 URL: https://issues.apache.org/jira/browse/HBASE-2321 Project: HBase Issue Type: Improvement Components: coprocessors Reporter: Andrew Purtell Assignee: Gary Helmling Fix For: 0.92.0 Now we are able to append methods to interfaces without breaking RPC compatibility with earlier releases. However there is no way that I am aware of to dynamically add entire new RPC interfaces. Methods/parameters are fixed to the class used to instantiate the server at that time. Coprocessors need this. They will extend functionality on regions in arbitrary ways. How to support that on the client side? A couple of options: 1. New RPC from scratch. 2. Modify HBaseServer such that multiple interface objects can be used for reflection and objects can be added or removed at runtime. 3. Have the coprocessor host instantiate new HBaseServer instances on ephemeral ports and publish the endpoints to clients via Zookeeper. Couple this with a small modification to HBaseServer to support elastic thread pools to minimize the number of threads that might be kept around in the JVM. 4. Add a generic method to HRegionInterface, an ioctl-like construction, which accepts a ImmutableBytesWritable key and an array of Writable as parameters. My opinion is we should opt for #4 as it is the simplest and most expedient approach. I could also do #3 if consensus prefers. Really we should do #1 but it's not clear who has the time for that at the moment. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3973) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY
HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY -- Key: HBASE-3973 URL: https://issues.apache.org/jira/browse/HBASE-3973 Project: HBase Issue Type: Improvement Components: shell Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor In the HBase shell, when the output isn't a TTY, the shell assumes the terminal to be 100 characters wide. The way the shell wraps things around makes it very hard to script the output of the shell (e.g. redirect the output to a file and then work on that file, or pipe the output to another command). When stdout isn't a TTY, the shell shouldn't try to wrap things around. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3973) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY
[ https://issues.apache.org/jira/browse/HBASE-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-3973: -- Status: Patch Available (was: Open) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY -- Key: HBASE-3973 URL: https://issues.apache.org/jira/browse/HBASE-3973 Project: HBase Issue Type: Improvement Components: shell Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: hbase-hirb-formatter.patch In the HBase shell, when the output isn't a TTY, the shell assumes the terminal to be 100 characters wide. The way the shell wraps things around makes it very hard to script the output of the shell (e.g. redirect the output to a file and then work on that file, or pipe the output to another command). When stdout isn't a TTY, the shell shouldn't try to wrap things around. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3973) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY
[ https://issues.apache.org/jira/browse/HBASE-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-3973: -- Attachment: hbase-hirb-formatter.patch Patch to fix the issue. HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY -- Key: HBASE-3973 URL: https://issues.apache.org/jira/browse/HBASE-3973 Project: HBase Issue Type: Improvement Components: shell Reporter: Benoit Sigoure Assignee: Benoit Sigoure Priority: Minor Attachments: hbase-hirb-formatter.patch In the HBase shell, when the output isn't a TTY, the shell assumes the terminal to be 100 characters wide. The way the shell wraps things around makes it very hard to script the output of the shell (e.g. redirect the output to a file and then work on that file, or pipe the output to another command). When stdout isn't a TTY, the shell shouldn't try to wrap things around. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3859) Increment a counter when a Scanner lease expires
Increment a counter when a Scanner lease expires Key: HBASE-3859 URL: https://issues.apache.org/jira/browse/HBASE-3859 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.90.2 Reporter: Benoit Sigoure Priority: Minor Whenever a Scanner lease expires, the RegionServer will close it automatically and log a message to complain. I would like the RegionServer to increment a counter whenever this happens and expose this counter through the metrics system, so we can plug this into our monitoring system (OpenTSDB) and keep track of how frequently this happens. It's not supposed to happen frequently so it's good to keep an eye on it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3850) Log more details when a scanner lease expires
Log more details when a scanner lease expires - Key: HBASE-3850 URL: https://issues.apache.org/jira/browse/HBASE-3850 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Benoit Sigoure Priority: Minor The message logged by the RegionServer when a Scanner lease expires isn't as useful as it could be. {{Scanner 4765412385779771089 lease expired}} - most clients don't log their scanner ID, so it's really hard to figure out what was going on. I think it would be useful to at least log the name of the region on which the Scanner was open, and it would be great to have the ip:port of the client that had that lease too. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3732) New configuration option for client-side compression
[ https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13021675#comment-13021675 ] Benoit Sigoure commented on HBASE-3732: --- Sounds good Stack. New configuration option for client-side compression Key: HBASE-3732 URL: https://issues.apache.org/jira/browse/HBASE-3732 Project: HBase Issue Type: New Feature Reporter: Jean-Daniel Cryans Fix For: 0.92.0 We have a case here where we have to store very fat cells (arrays of integers) which can amount into the hundreds of KBs that we need to read often, concurrently, and possibly keep in cache. Compressing the values on the client using java.util.zip's Deflater before sending them to HBase proved to be in our case almost an order of magnitude faster. There reasons are evident: less data sent to hbase, memstore contains compressed data, block cache contains compressed data too, etc. I was thinking that it might be something useful to add to a family schema, so that Put/Result do the conversion for you. The actual compression algo should also be configurable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3732) New configuration option for client-side compression
[ https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016021#comment-13016021 ] Benoit Sigoure commented on HBASE-3732: --- Oh yeah I forgot that this was in the {{info:regioninfo}} column, my bad. Wouldn't it be awesome if this was actually on a key-per-key basis? Is there a spare bit in {{KeyValue}} we can steal to indicate this KV is compressed? We could not only compress the value, but also the column qualifier and/or the key if they're big too (some applications store data in the column qualifier or, less frequently, in the key). New configuration option for client-side compression Key: HBASE-3732 URL: https://issues.apache.org/jira/browse/HBASE-3732 Project: HBase Issue Type: New Feature Reporter: Jean-Daniel Cryans Fix For: 0.92.0 We have a case here where we have to store very fat cells (arrays of integers) which can amount into the hundreds of KBs that we need to read often, concurrently, and possibly keep in cache. Compressing the values on the client using java.util.zip's Deflater before sending them to HBase proved to be in our case almost an order of magnitude faster. There reasons are evident: less data sent to hbase, memstore contains compressed data, block cache contains compressed data too, etc. I was thinking that it might be something useful to add to a family schema, so that Put/Result do the conversion for you. The actual compression algo should also be configurable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3732) New configuration option for client-side compression
[ https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015753#comment-13015753 ] Benoit Sigoure commented on HBASE-3732: --- If you want {{Put}}/{{Result}} to do the conversion for you, that means the client needs to be aware of the schema of the table before it can start using it, right? Because right now HBase clients don't know the schema, so it's something extra that they'd need to lookup separately, unless we add new fields in the {{.META.}} table that go along with each and every region. New configuration option for client-side compression Key: HBASE-3732 URL: https://issues.apache.org/jira/browse/HBASE-3732 Project: HBase Issue Type: New Feature Reporter: Jean-Daniel Cryans Fix For: 0.92.0 We have a case here where we have to store very fat cells (arrays of integers) which can amount into the hundreds of KBs that we need to read often, concurrently, and possibly keep in cache. Compressing the values on the client using java.util.zip's Deflater before sending them to HBase proved to be in our case almost an order of magnitude faster. There reasons are evident: less data sent to hbase, memstore contains compressed data, block cache contains compressed data too, etc. I was thinking that it might be something useful to add to a family schema, so that Put/Result do the conversion for you. The actual compression algo should also be configurable. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-2170) hbase lightweight client library as a distribution
[ https://issues.apache.org/jira/browse/HBASE-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008872#comment-13008872 ] Benoit Sigoure commented on HBASE-2170: --- bq. This is an impressive number. Just curious if u were able to run the same benchmark with WAL turned on, and what numbers you see then.. Curiously enough, I see the same numbers. This is the 1st import I did Thursday (no WAL) {code} $ ./src/tsdb import /tmp/data.gz [...] 2011-03-17 18:45:51,797 INFO [main] TextImporter: ... 100 data points in 6688ms (149521.5 points/s) 2011-03-17 18:45:56,836 INFO [main] TextImporter: ... 200 data points in 5044ms (198255.4 points/s) 2011-03-17 18:46:01,823 INFO [main] TextImporter: ... 300 data points in 4986ms (200561.6 points/s) 2011-03-17 18:46:06,848 INFO [main] TextImporter: ... 400 data points in 5025ms (199005.0 points/s) 2011-03-17 18:46:11,865 INFO [main] TextImporter: ... 500 data points in 5016ms (199362.0 points/s) 2011-03-17 18:46:14,315 INFO [main] TextImporter: Processed /tmp/data.gz in 29211 ms, 5487065 data points (187842.4 points/s) 2011-03-17 18:46:14,315 INFO [main] TextImporter: Total: imported 5487065 data points in 29.212s (187838.4 points/s) {code} Note: 1 data point = 1 {{KeyValue}}. I commented out {{dp.setBatchImport(true);}} in [TextImporter.getDataPoints|https://github.com/stumbleupon/opentsdb/blob/master/src/tools/TextImporter.java#L225] and ran the same import again. Note: this isn't exactly an apples-to-apples comparison because I'm going to overwrite existing {{KeyValue}} instead of creating new ones. The table has {{VERSIONS=1}} but I think we disabled major compactions so we don't delete old data (Stack/JD correct me if I'm mistaken about our setup). {code} $ ./src/tsdb import /tmp/data.gz [...] 2011-03-19 19:09:36,102 INFO [main] TextImporter: ... 100 data points in 6699ms (149276.0 points/s) 2011-03-19 19:09:41,101 INFO [main] TextImporter: ... 200 data points in 5004ms (199840.1 points/s) 2011-03-19 19:09:46,051 INFO [main] TextImporter: ... 300 data points in 4949ms (202061.0 points/s) 2011-03-19 19:09:51,006 INFO [main] TextImporter: ... 400 data points in 4955ms (201816.3 points/s) 2011-03-19 19:09:56,017 INFO [main] TextImporter: ... 500 data points in 5010ms (199600.8 points/s) 2011-03-19 19:09:58,422 INFO [main] TextImporter: Processed /tmp/data.gz in 29025 ms, 5487065 data points (189046.2 points/s) 2011-03-19 19:09:58,422 INFO [main] TextImporter: Total: imported 5487065 data points in 29.026s (189041.3 points/s) {code} So... this totally surprises me. I expected to see a big performance drop with the WAL enabled. I wondered if I didn't properly recompile the code or if something else was still disabling the WAL, but I verified with {{strace}} that the WAL was turned on in the RPC that was going out: {code} $ strace -f -e trace=write -s 4096 ./src/tsdb import /tmp/data.gz [...] [pid 21364] write(32, \0\0\312\313\0\0\0\3\0\10multiPut\0\0\0\00199\0\0\0\1Btsdb,\0\3\371L\301[\360\0\0\7\0\2;,1300586854474.a2a283a471dfcf5dcda82d05f2d468ed.\0\0\0:\1\r\0\3\371MZ2\200\0\0\7\0\0\216\177\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\1\0\0\0\1\1t\0\0\0(\0\0\6\340\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\0{\177\377\377\377\377\377\377\377\4\0\0\0\0C\0\0\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\1k\177\377\377\377\377\377\377\377\4\0\0\0\0Cd... {code} This shows that the WAL is enabled. Having the source of [asynchbase's {{MultiPutRequest}}|https://github.com/stumbleupon/asynchbase/blob/master/src/MultiPutRequest.java#L274] greatly helps make sense of this otherwise impossible to understand blob: * We can easily see where the region name is, it contains an MD5 sum followed by a period ({{.}}). * After the region name, the next 4 bytes are the number of edits for this region: {{\0\0\0:}} = 58 * Then there's a byte with value 1 with the versioning of the {{Put}} object: {{\1}} * Then there's a the row key of the row we're writing to: {{\r\0\3\371MZ2\200\0\0\7\0\0\216}} where: ** {{\r}} is a {{vint}} indicating that the key length is 13 bytes ** The first 3 bytes of the row key in OpenTSDB correspond to the metric ID: {{\0\3\371}} ** The next 4 bytes in OpenTSDB correspond to a UNIX timestamp: {{MZ2\200}}. Using Python, it's easy to confirm that: {code} import struct import time struct.unpack(I, MZ2\200) (1297756800,) time.ctime(*_) 'Tue Feb 15 00:00:00 2011' {code} ** The next 6 bytes in OpenTSDB correspond to a tag: *** 3 bytes for a tag name ID: {{\0\0\7}} *** 3 bytes for a tag value ID: {{\0\0\216}} * Then we have the timestamp of the edit, which is unset, so it's {{Long.MAX_VALUE}} which is {{\177\377\377\377\377\377\377\377}} * Then we have the {{RowLock}} ID. In this case no row lock is involved, so the value is {{-1L}}:
[jira] Commented: (HBASE-3671) Split report before we finish parent region open; workaround till 0.92; Race between split and OPENED processing
[ https://issues.apache.org/jira/browse/HBASE-3671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008574#comment-13008574 ] Benoit Sigoure commented on HBASE-3671: --- +1 too, thanks for the quick turnaround guys! Split report before we finish parent region open; workaround till 0.92; Race between split and OPENED processing Key: HBASE-3671 URL: https://issues.apache.org/jira/browse/HBASE-3671 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: stack Attachments: 3671.txt This issue is about adding a workaround to 0.90 until we get proper fix in 0.92 (HBASE-3559). Here is the sequence of events: 1. We start to process OPENED region event. 2. We receive a SPLIT of this region report. 3. SPLIT processing offline the region and onlines daughters. 4. Metascanner runs and clears out the region from .META. deleting it 5. The OPENED handler runs. Marks the region online in Master memory. 6. Balancer runs. Trys to balance a region that has been deleted. Loops for ever. Here is excerpt from logs. It happened during startup, lots going on. Could happen on regionserver crash I suppose, maybe, but we're susceptible during cluster start: {code} # We assign the region 2011-03-16 15:18:29,053 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x22e286f0b9c98f1 Async create of unassigned node for 3516b74d0c9d4458c2f2f715249e3f78 with OFFLINE state ... 2011-03-16 15:18:32,298 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78. state=OFFLINE, ts=1300313909053, server=sv4borg39,60020,1300313564807 ... 2011-03-16 15:18:32,732 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78. state=OFFLINE, ts=1300313909053 ... 2011-03-16 15:23:02,114 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:6-0x22e286f0b9c98f1 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/prodjobs/unassigned/3516b74d0c9d4458c2f2f715249e3f78 ... 2011-03-16 15:23:02,183 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x22e286f0b9c98f1 Retrieved 127 byte(s) of data from znode /prodjobs/unassigned/3516b74d0c9d4458c2f2f715249e3f78 and set watcher; region=tsdb,^@^D2McZ@^@^@^A^@^@G^@^@^L^@^@f^@^@^U^@^@�^@^@(^@^C^G,1299401073466.3516b74d0c9d4458c2f2f715249e3f78., server=sv4borg39,60020,1300313564807, state=RS_ZK_REGION_OPENED 2011-03-16 15:23:02,183 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=sv4borg39,60020,1300313564807, region=3516b74d0c9d4458c2f2f715249e3f78 # At this point we've queued an Excecutor to run to process the OPENED event. Now in comes the SPLIT. 2011-03-16 15:23:18,199 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.: Daughters; tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1300314189812.74c51400bb8dfa127fadfd11a04d72f2., tsdb,\x00\x042MmD\x88\x00\x00\x01\x00\x00S\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x029\x00\x00(\x00\x03\x03,1300314189812.87b061739a11d0f9d02acfb92ef961a2. from sv4borg39,60020,1300313564807 2011-03-16 15:23:18,870 WARN org.apache.hadoop.hbase.master.AssignmentManager: Split report has RIT node (shouldnt have one): REGION = {NAME = 'tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.', STARTKEY = '\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07', ENDKEY = '\x00\x043L\xE7\xF50\x00\x00\x01\x00\x00I\x00\x00\x0C\x00\x00f\x00\x00\x0E\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x02u', ENCODED = 3516b74d0c9d4458c2f2f715249e3f78, TABLE = {{NAME = 'tsdb', FAMILIES = [{NAME = 't', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'LZO', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}} node: region=tsdb,^@^D2McZ@^@^@^A^@^@G^@^@^L^@^@f^@^@^U^@^@�^@^@(^@^C^G,1299401073466.3516b74d0c9d4458c2f2f715249e3f78., server=sv4borg39,60020,1300313564807, state=RS_ZK_REGION_OPENED # Now