[jira] [Created] (HBASE-28569) Race condition during WAL splitting leading to corrupt recovered.edits

2024-05-06 Thread Benoit Sigoure (Jira)
Benoit Sigoure created HBASE-28569:
--

 Summary: Race condition during WAL splitting leading to corrupt 
recovered.edits
 Key: HBASE-28569
 URL: https://issues.apache.org/jira/browse/HBASE-28569
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 2.4.17
Reporter: Benoit Sigoure


There is a race condition that can happen when a regionserver aborts 
initialisation while splitting a WAL from another regionserver. This race leads 
to writing the WAL trailer for recovered edits while the writer threads are 
still running, thus the trailer gets interleaved with the edits corrupting the 
recovered edits file (and preventing the region to be assigned).
We've seen this happening on HBase 2.4.17, but looking at the latest code it 
seems that the race can still happen there.
The sequence of operations that leads to this issue: * 
{{org.apache.hadoop.hbase.wal.WALSplitter.splitWAL}} calls 
{{outputSink.close()}} after adding all the entries to the buffers
 * The output sink is {{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink}} 
and its {{close}} method calls first {{finishWriterThreads}} in a try block 
which in turn will call {{finish}} on every thread and then join it to make 
sure it's done.
 * However if the splitter thread gets interrupted because of RS aborting, the 
join will get interrupted and {{finishWriterThreads}} will rethrow without 
waiting for the writer threads to stop.
 * This is problematic because coming back to 
{{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink.close}} it will call 
{{closeWriters}} in a finally block (so it will execute even when the join was 
interrupted).
 * {{closeWriters}} will call 
{{org.apache.hadoop.hbase.wal.AbstractRecoveredEditsOutputSink.closeRecoveredEditsWriter}}
 which will call {{close}} on {{{}editWriter.writer{}}}.
 * When {{editWriter.writer}} is 
{{{}org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter{}}}, its 
{{close}} method will write the trailer before closing the file.
 * This trailer write will now go in parallel with writer threads writing 
entries causing corruption.
 * If there are no other errors, {{closeWriters}} will succeed renaming all 
temporary files to final recovered edits, causing problems next time the region 
is assigned.

Logs evidence supporting the above flow:
Abort is triggered (because it failed to open the WAL due to some ongoing infra 
issue):
{noformat}
regionserver-2 regionserver 06:22:00.384 
[RS_OPEN_META-regionserver/host01:16201-0] ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer - * ABORTING region 
server host01,16201,1709187641249: WAL can not clean up after init failed 
*{noformat}

We can see that the writer threads were still active after closing (even 
considering that the
ordering in the log might not be accurate, we see that they die because the 
channel is closed while still writing, not because they're stopping):
{noformat}
regionserver-2 regionserver 06:22:09.662 [DataStreamer for file 
/hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201%2C1709180140645.1709186722780.temp
 block BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368] WARN  
org.apache.hadoop.hdfs.DataStreamer - Error Recovery for 
BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368 in pipeline 
[DatanodeInfoWithStorage[192.168.2.230:15010,DS-2aa201ab-1027-47ec-b05f-b39d795fda85,DISK],
 
DatanodeInfoWithStorage[192.168.2.232:15010,DS-39651d5a-67d2-4126-88f0-45cdee967dab,DISK],
 Datanode
InfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK]]:
 datanode 
2(DatanodeInfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK])
 is bad.
regionserver-2 regionserver 06:22:09.742 [split-log-closeStream-pool-1] INFO  
org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Closed recovered edits 
writer 
path=hdfs://mycluster/hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201%
2C1709180140645.1709186722780.temp (wrote 5949 edits, skipped 0 edits in 93 ms)
regionserver-2 regionserver 06:22:09.743 
[RS_LOG_REPLAY_OPS-regionserver/host01:16201-1-Writer-0] ERROR 
org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Failed to write log 
entry aeris_v2/53308260a6b22eaf6ebb8353f7df3077/3169611655=[#edits: 8 = 
] to log
regionserver-2 regionserver java.nio.channels.ClosedChannelException: null
regionserver-2 regionserver    at 
org.apache.hadoop.hdfs.ExceptionLastSeen.throwException4Close(ExceptionLastSeen.java:73)
 ~[hadoop-hdfs-client-3.2.4.jar:?]
regionserver-2 regionserver    at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:153) 
~[hadoop-hdfs-client-3.2.4.jar:?]
regionserver-2 regionserver    at 

[jira] [Updated] (HBASE-28569) Race condition during WAL splitting leading to corrupt recovered.edits

2024-05-06 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-28569:
---
Description: 
There is a race condition that can happen when a regionserver aborts 
initialisation while splitting a WAL from another regionserver. This race leads 
to writing the WAL trailer for recovered edits while the writer threads are 
still running, thus the trailer gets interleaved with the edits corrupting the 
recovered edits file (and preventing the region to be assigned).
We've seen this happening on HBase 2.4.17, but looking at the latest code it 
seems that the race can still happen there.
The sequence of operations that leads to this issue:
 * {{org.apache.hadoop.hbase.wal.WALSplitter.splitWAL}} calls 
{{outputSink.close()}} after adding all the entries to the buffers
 * The output sink is {{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink}} 
and its {{close}} method calls first {{finishWriterThreads}} in a try block 
which in turn will call {{finish}} on every thread and then join it to make 
sure it's done.
 * However if the splitter thread gets interrupted because of RS aborting, the 
join will get interrupted and {{finishWriterThreads}} will rethrow without 
waiting for the writer threads to stop.
 * This is problematic because coming back to 
{{org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink.close}} it will call 
{{closeWriters}} in a finally block (so it will execute even when the join was 
interrupted).
 * {{closeWriters}} will call 
{{org.apache.hadoop.hbase.wal.AbstractRecoveredEditsOutputSink.closeRecoveredEditsWriter}}
 which will call {{close}} on {{{}editWriter.writer{}}}.
 * When {{editWriter.writer}} is 
{{{}org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter{}}}, its 
{{close}} method will write the trailer before closing the file.
 * This trailer write will now go in parallel with writer threads writing 
entries causing corruption.
 * If there are no other errors, {{closeWriters}} will succeed renaming all 
temporary files to final recovered edits, causing problems next time the region 
is assigned.

Logs evidence supporting the above flow:
Abort is triggered (because it failed to open the WAL due to some ongoing infra 
issue):
{noformat}
regionserver-2 regionserver 06:22:00.384 
[RS_OPEN_META-regionserver/host01:16201-0] ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer - * ABORTING region 
server host01,16201,1709187641249: WAL can not clean up after init failed 
*{noformat}
We can see that the writer threads were still active after closing (even 
considering that the
ordering in the log might not be accurate, we see that they die because the 
channel is closed while still writing, not because they're stopping):
{noformat}
regionserver-2 regionserver 06:22:09.662 [DataStreamer for file 
/hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201%2C1709180140645.1709186722780.temp
 block BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368] WARN  
org.apache.hadoop.hdfs.DataStreamer - Error Recovery for 
BP-1645452845-192.168.2.230-1615455682886:blk_1076340939_2645368 in pipeline 
[DatanodeInfoWithStorage[192.168.2.230:15010,DS-2aa201ab-1027-47ec-b05f-b39d795fda85,DISK],
 
DatanodeInfoWithStorage[192.168.2.232:15010,DS-39651d5a-67d2-4126-88f0-45cdee967dab,DISK],
 Datanode
InfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK]]:
 datanode 
2(DatanodeInfoWithStorage[192.168.2.231:15010,DS-e08a1d17-f7b1-4e39-9713-9706bd762f48,DISK])
 is bad.
regionserver-2 regionserver 06:22:09.742 [split-log-closeStream-pool-1] INFO  
org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Closed recovered edits 
writer 
path=hdfs://mycluster/hbase/data/default/aeris_v2/53308260a6b22eaf6ebb8353f7df3077/recovered.edits/03169600719-host02%2C16201%
2C1709180140645.1709186722780.temp (wrote 5949 edits, skipped 0 edits in 93 ms)
regionserver-2 regionserver 06:22:09.743 
[RS_LOG_REPLAY_OPS-regionserver/host01:16201-1-Writer-0] ERROR 
org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink - Failed to write log 
entry aeris_v2/53308260a6b22eaf6ebb8353f7df3077/3169611655=[#edits: 8 = 
] to log
regionserver-2 regionserver java.nio.channels.ClosedChannelException: null
regionserver-2 regionserver    at 
org.apache.hadoop.hdfs.ExceptionLastSeen.throwException4Close(ExceptionLastSeen.java:73)
 ~[hadoop-hdfs-client-3.2.4.jar:?]
regionserver-2 regionserver    at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:153) 
~[hadoop-hdfs-client-3.2.4.jar:?]
regionserver-2 regionserver    at 
org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) 
~[hadoop-common-3.2.4.jar:?]
regionserver-2 regionserver    at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
 ~[hadoop-common-3.2.4.jar:?]
regionserver-2 

[jira] [Commented] (HBASE-27696) [hbase-operator-tools] Use $revision as placeholder for maven version

2023-03-27 Thread Benoit Sigoure (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705733#comment-17705733
 ] 

Benoit Sigoure commented on HBASE-27696:


Hi guys, any plans to cut a release any time soon?

> [hbase-operator-tools] Use $revision as placeholder for maven version
> -
>
> Key: HBASE-27696
> URL: https://issues.apache.org/jira/browse/HBASE-27696
> Project: HBase
>  Issue Type: Task
>  Components: build, pom
>Affects Versions: hbase-operator-tools-1.3.0
>Reporter: Nick Dimiduk
>Assignee: Nick Dimiduk
>Priority: Major
> Fix For: hbase-operator-tools-1.3.0
>
>
> To align with our main repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27357) Online schema migration causes the master to report a very high or negative number of QPS

2022-09-05 Thread Benoit Sigoure (Jira)
Benoit Sigoure created HBASE-27357:
--

 Summary: Online schema migration causes the master to report a 
very high or negative number of QPS
 Key: HBASE-27357
 URL: https://issues.apache.org/jira/browse/HBASE-27357
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.4.12
 Environment: |JVM Version|Oracle Corporation 11.0.15-11.0.15+10|
Reporter: Benoit Sigoure
 Attachments: Screen Shot 2022-09-05 at 18.31.00.png, Screen Shot 
2022-09-05 at 18.31.06.png

We've seen this a few times when making an online schema change, e.g.:
{code:java}
alter 'foo', {NAME=>'e',VERSIONS=>'2147483646'} {code}
This causes the master to briefly show extremely high QPS per regionserver, and 
probably causes an integer overflow in the sum.

!Screen Shot 2022-09-05 at 18.31.00.png|width=859,height=323!

...

!Screen Shot 2022-09-05 at 18.31.06.png|width=856,height=322!

This could be related to the issue reported in HBASE-27242.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HBASE-27242) HBase master reports a region has been in transition for 50+ years

2022-07-26 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-27242:
---
Description: 
Every time we upgrade our HBase clusters we get some spurious alerts firing 
because for a brief period of time the HBase master reports some impossibly 
high RIT (region in transition) time.

For example:

!image.png|width=747,height=249!

 

The condition resolves itself on its own within a few minutes. I'll try to find 
some relevant logs and attach them to this issue.

  was:
Every time we upgrade our HBase clusters we get some spurious alerts firing 
because for a brief period of time the HBase master reports some impossibly 
high RIT (region in transition) time.

For example:

!image.png!

 

The condition resolves itself on its own within a few minutes. I'll try to find 
some relevant logs and attach them to this issue.


> HBase master reports a region has been in transition for 50+ years
> --
>
> Key: HBASE-27242
> URL: https://issues.apache.org/jira/browse/HBASE-27242
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.8, 2.4.9, 2.4.10, 2.4.11
> Environment: openjdk version "11.0.14.1" 2022-02-08
> OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
>Reporter: Benoit Sigoure
>Priority: Major
> Attachments: image.png
>
>
> Every time we upgrade our HBase clusters we get some spurious alerts firing 
> because for a brief period of time the HBase master reports some impossibly 
> high RIT (region in transition) time.
> For example:
> !image.png|width=747,height=249!
>  
> The condition resolves itself on its own within a few minutes. I'll try to 
> find some relevant logs and attach them to this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27242) HBase master reports a region has been in transition for 50+ years

2022-07-26 Thread Benoit Sigoure (Jira)
Benoit Sigoure created HBASE-27242:
--

 Summary: HBase master reports a region has been in transition for 
50+ years
 Key: HBASE-27242
 URL: https://issues.apache.org/jira/browse/HBASE-27242
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.4.11, 2.4.10, 2.4.9, 2.4.8
 Environment: openjdk version "11.0.14.1" 2022-02-08

OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)

OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
Reporter: Benoit Sigoure
 Attachments: image.png

Every time we upgrade our HBase clusters we get some spurious alerts firing 
because for a brief period of time the HBase master reports some impossibly 
high RIT (region in transition) time.

For example:

!image.png!

 

The condition resolves itself on its own within a few minutes. I'll try to find 
some relevant logs and attach them to this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed'

2022-03-11 Thread Benoit Sigoure (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505128#comment-17505128
 ] 

Benoit Sigoure commented on HBASE-26042:


For some reason Mike can't upload files (maybe new accounts aren't immediately 
allowed to upload attachments?), anyway, I just posted a heap dump along with 
the thread dump that was taken at the ~same time.

> WAL lockup on 'sync failed'
> ---
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2, 
> regionserver-heap-live.hprof.gz, regionserver-threaddump.log
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - 

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed'

2022-03-11 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-26042:
---
Attachment: regionserver-heap-live.hprof.gz

> WAL lockup on 'sync failed'
> ---
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2, 
> regionserver-heap-live.hprof.gz, regionserver-threaddump.log
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 
> (Interpreted frame)
>  - 

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed'

2022-03-11 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-26042:
---
Attachment: regionserver-threaddump.log

> WAL lockup on 'sync failed'
> ---
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2, 
> regionserver-threaddump.log
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 
> (Interpreted frame)
>  - com.lmax.disruptor.RingBuffer.next() @bci=4, line=263 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed'

2022-03-11 Thread Benoit Sigoure (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505123#comment-17505123
 ] 

Benoit Sigoure commented on HBASE-26042:


Hi Andrew, thanks for your reply. I already attached the regionserver logs as 
well as the stack trace {{/dump}} from the servlet. Mike is going to post a 
heap dump soon.

We've been seeing quite a few instances of this bug lately, I think a number of 
the "HBase is stuck" kinda reports I've heard about over the past year or so 
were likely due to this bug. We are able to reproduce it relatively easily by 
taking a cluster and killing nodes randomly.

> WAL lockup on 'sync failed'
> ---
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - 

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed'

2022-03-08 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-26042:
---
Summary: WAL lockup on 'sync failed'  (was: WAL lockup on 'sync failed' 
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
readAddress(..) failed: Connection reset by peer)

> WAL lockup on 'sync failed'
> ---
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 
> (Compiled frame)
>  - 

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2022-03-08 Thread Benoit Sigoure (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502810#comment-17502810
 ] 

Benoit Sigoure commented on HBASE-26042:


We've run into this issue on a test cluster with HBase 2.4.8.

Let me know if I can collect anything else to help you, as things are still 
stuck right now and we can probably keep it untouched for another day or two as 
it's a test cluster.

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: 

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2022-03-08 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-26042:
---
Affects Version/s: 2.4.8

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5, 2.4.8
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - 

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2022-03-08 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-26042:
---
Attachment: debug-dump.txt

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - 

[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

2022-03-08 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-26042:
---
Attachment: hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log

> WAL lockup on 'sync failed' 
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
> readAddress(..) failed: Connection reset by peer
> 
>
> Key: HBASE-26042
> URL: https://issues.apache.org/jira/browse/HBASE-26042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.3.5
>Reporter: Michael Stack
>Priority: Major
> Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, 
> hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2
>
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 
> seconds, a few STUCK MVCCs from which it recovered and some java pauses up to 
> three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://:8020/hbase] 
> wal.AsyncFSWAL: sync 
> failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints 
> text versions of requests complaining requestsTooSlow. Then we start to see 
> these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync 
> result after 30 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will 
> roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test 
> -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter
>  {code}
> I tried hacking the test to repro the above hang by throwing same exception 
> in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is 
> stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, 
> line=1707 (Compiled frame)
>  - 
> java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
>  @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, 
> line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled 
> frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer)
>  @bci=16, line=189 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[],
>  org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) 
> @bci=9, line=202 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, 
> long) @bci=107, line=170 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration,
>  org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, 
> org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) 
> @bci=61, line=113 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=22, line=651 (Compiled frame)
>  - 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path)
>  @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) 
> @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) 
> @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 
> (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. 
> unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 
> (Compiled frame)
>  - 

[jira] [Commented] (HBASE-21476) Support for nanosecond timestamps

2022-01-03 Thread Benoit Sigoure (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468258#comment-17468258
 ] 

Benoit Sigoure commented on HBASE-21476:


Uploaded a patch for use against HBase version 2.4.9

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, 
> HBASE-21476.branch-2.4.001.patch, nanosecond_timestamps_v1.patch, 
> nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-21476) Support for nanosecond timestamps

2022-01-03 Thread Benoit Sigoure (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-21476:
---
Attachment: HBASE-21476.branch-2.4.001.patch

> Support for nanosecond timestamps
> -
>
> Key: HBASE-21476
> URL: https://issues.apache.org/jira/browse/HBASE-21476
> Project: HBase
>  Issue Type: New Feature
>Affects Versions: 2.1.1
>Reporter: Andrey Elenskiy
>Assignee: Andrey Elenskiy
>Priority: Major
>  Labels: features, patch
> Attachments: Apache HBase - Nanosecond Timestamps v1.pdf, 
> HBASE-21476.branch-2.1.0003.patch, HBASE-21476.branch-2.1.0004.patch, 
> HBASE-21476.branch-2.4.001.patch, nanosecond_timestamps_v1.patch, 
> nanosecond_timestamps_v2.patch
>
>
> Introducing a new table attribute "NANOSECOND_TIMESTAMPS" to tell HBase to 
> handle timestamps with nanosecond precision. This is useful for applications 
> that timestamp updates at the source with nanoseconds and still want features 
> like column family TTL and "hbase.hstore.time.to.purge.deletes" to work.
> The attribute should be specified either on new tables or on existing tables 
> which have timestamps only with nanosecond precision. There's no migration 
> from milliseconds to nanoseconds for already existing tables. We could add 
> this migration as part of compaction if you think that would be useful, but 
> that would obviously make the change more complex.
> I've added a new EnvironmentEdge method "currentTimeNano()" that uses 
> [java.time.Instant|https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html]
>  to get time in nanoseconds which means it will only work with Java 8. The 
> idea is to gradually replace all places where "EnvironmentEdge.currentTime()" 
> is used to have HBase working purely with nanoseconds (which is a 
> prerequisite for HBASE-14070). Also, I've refactored ScanInfo and 
> PartitionedMobCompactor to expect TableDescriptor as an argument which makes 
> code a little cleaner and easier to extend.
> Couple more points:
> - column family TTL (specified in seconds) and 
> "hbase.hstore.time.to.purge.deletes" (specified in milliseconds) options 
> don't need to be changed, those are adjusted automatically.
> - Per cell TTL needs to be scaled by clients accordingly after 
> "NANOSECOND_TIMESTAMPS" table attribute is specified.
> Looking for everyone's feedback to know if that's a worthwhile direction. 
> Will add more comprehensive tests in a later patch.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26383) HBCK incorrectly reports inconsistencies for recently split regions following a master failover

2021-10-20 Thread Benoit Sigoure (Jira)
Benoit Sigoure created HBASE-26383:
--

 Summary: HBCK incorrectly reports inconsistencies for recently 
split regions following a master failover
 Key: HBASE-26383
 URL: https://issues.apache.org/jira/browse/HBASE-26383
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.4.3
Reporter: Benoit Sigoure


When a region P splits into A and B, following a master failover the newly 
active master reports that P is in an inconsistent state. This seems to be a 
regression introduced in HBASE-25847 (cc [~andrew.purt...@gmail.com]) which 
changed {{regionInfo.isParentSplit()}} to {{regionState.isSplit()}}. The region 
state after restart is CLOSED (rather than SPLIT), so both region state and 
region info should be checked, presumably with {{regionState.isSplit() || 
regionInfo.isSplit()}}. This situation resolves itself on its own when a major 
compaction occurs and P is GCed, but having the master incorrectly report 
inconsistencies is pretty bad. We had a pretty big outage due to a series of 
operator errors as our SRE team was trying to fix this inconsistency that, in 
fact, didn't even exist.

Thanks to Stack for helping look over this issue and Vlad Hanciuta for root 
causing the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"

2018-05-16 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478183#comment-16478183
 ] 

Benoit Sigoure commented on HBASE-20463:


You're right, my bad. Once I downgraded to JDK8 things started working again. 
Thanks!

> Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL 
> change and document"
> --
>
> Key: HBASE-20463
> URL: https://issues.apache.org/jira/browse/HBASE-20463
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Reporter: stack
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.5.0, 1.4.4
>
> Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch
>
>
> Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] 
> (and [~apurtell]).
> See parent for discussion on breakage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"

2018-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499
 ] 

Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:32 PM:
-

I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code}

{code}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}


was (Author: tsuna):
I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code:java}

{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}

> Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL 
> change and document"
> --
>
> Key: HBASE-20463
> URL: https://issues.apache.org/jira/browse/HBASE-20463
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Reporter: stack
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.5.0, 1.4.4
>
> Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch
>
>
> Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] 
> (and [~apurtell]).
> See parent for discussion on breakage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"

2018-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499
 ] 

Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:32 PM:
-

I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])
{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}


was (Author: tsuna):
I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code}

{code}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}

> Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL 
> change and document"
> --
>
> Key: HBASE-20463
> URL: https://issues.apache.org/jira/browse/HBASE-20463
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Reporter: stack
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.5.0, 1.4.4
>
> Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch
>
>
> Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] 
> (and [~apurtell]).
> See parent for discussion on breakage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"

2018-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499
 ] 

Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:31 PM:
-

I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code}

{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}


was (Author: tsuna):
I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code}
{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}

> Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL 
> change and document"
> --
>
> Key: HBASE-20463
> URL: https://issues.apache.org/jira/browse/HBASE-20463
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Reporter: stack
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.5.0, 1.4.4
>
> Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch
>
>
> Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] 
> (and [~apurtell]).
> See parent for discussion on breakage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"

2018-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499
 ] 

Benoit Sigoure edited comment on HBASE-20463 at 5/15/18 9:31 PM:
-

I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code:java}

{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}


was (Author: tsuna):
I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code}

{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}

> Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL 
> change and document"
> --
>
> Key: HBASE-20463
> URL: https://issues.apache.org/jira/browse/HBASE-20463
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Reporter: stack
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.5.0, 1.4.4
>
> Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch
>
>
> Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] 
> (and [~apurtell]).
> See parent for discussion on breakage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20463) Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL change and document"

2018-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476499#comment-16476499
 ] 

Benoit Sigoure commented on HBASE-20463:


I'm still seeing the error with HBase 1.4.4 (I'm using the binary release 
[here|http://www-us.apache.org/dist/hbase/1.4.4/hbase-1.4.4-bin.tar.gz])

{code:java}
foo@cc2a495eedfe:~$ java -version
openjdk version "9.0.4"
OpenJDK Runtime Environment (build 9.0.4+12-Debian-4)
OpenJDK 64-Bit Server VM (build 9.0.4+12-Debian-4, mixed mode){code}
{code}
{code:java}
foo@cc2a495eedfe:~$ /home/foo/hbase/bin/hbase shell
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.java.invokers.RubyToJavaInvoker 
(file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar) to method 
java.lang.Object.registerNatives()
WARNING: Please consider reporting this to the maintainers of 
org.jruby.java.invokers.RubyToJavaInvoker
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
ArgumentError: wrong number of arguments (0 for 1)
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:10
method_added at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/javasupport/core_ext/object.rb:129
Pattern at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:2
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:1
require at org/jruby/RubyKernel.java:1062
(root) at 
file:/home/foo/hbase/lib/jruby-complete-1.6.8.jar!/builtin/java/java.util.regex.rb:42
(root) at /home/foo/hbase/bin/../bin/hirb.rb:38
{code}

> Fix breakage introduced on branch-1 by HBASE-20276 "[shell] Revert shell REPL 
> change and document"
> --
>
> Key: HBASE-20463
> URL: https://issues.apache.org/jira/browse/HBASE-20463
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Reporter: stack
>Assignee: Sean Busbey
>Priority: Blocker
> Fix For: 1.5.0, 1.4.4
>
> Attachments: HBASE-20463-branch-1.v0.patch, HBASE-20463.0.patch
>
>
> Hope you don't mind my making an issue for fixing branch-1 breakage [~busbey] 
> (and [~apurtell]).
> See parent for discussion on breakage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-18372) Potential infinite busy loop in HMaster's ProcedureExecutor

2017-07-12 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-18372:
--

 Summary: Potential infinite busy loop in HMaster's 
ProcedureExecutor
 Key: HBASE-18372
 URL: https://issues.apache.org/jira/browse/HBASE-18372
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 1.3.1
 Environment: Kernel 3.10.0-327.10.1.el7.x86_64
JVM 1.8.0_102
Reporter: Benoit Sigoure


While investigating an issue today with [~timoha] we saw the HMaster 
consistently burning 1.5 cores of CPU cycles.  Upon looking more closely, it 
was actually all 8 threads of {{ProcedureExecutor}} thread pool taking 
constantly ~15% of a CPU core each (I identified this by looking at individual 
threads in {{top}} and cross-referencing the thread IDs with the thread IDs in 
a JVM stack trace).  The HMaster log or output didn't contain anything 
suspicious and it was hard for us to ascertain what exactly was happening.  It 
just looked like these threads were regularly spinning, doing nothing.  We just 
saw a lot of {{futex}} system calls happening all the time, and all the threads 
of the thread pool regularly taking turns in waking up and going back to sleep.

My reading of the code in {{procedure2/ProcedureExecutor.java}} is that this 
can happen if the threads in the thread pool have been interrupted for some 
reason:

{code}
  private void execLoop() {
while (isRunning()) {
  Procedure proc = runnables.poll();
  if (proc == null) continue;
{code}
and then in {master/procedure/MasterProcedureScheduler.java}:
{code}
  @Override
  public Procedure poll() {
return poll(-1);
  }

  @edu.umd.cs.findbugs.annotations.SuppressWarnings("WA_AWAIT_NOT_IN_LOOP")
  Procedure poll(long waitNsec) {
Procedure pollResult = null;
schedLock.lock();
try {
  if (queueSize == 0) {
if (waitNsec < 0) {
  schedWaitCond.await();
[...]
} catch (InterruptedException e) {
  Thread.currentThread().interrupt();
} finally {
  schedLock.unlock();
}
return pollResult;
  }
{code}

so my theory is the threads in the thread pool have all been interrupted (maybe 
by a procedure that ran earlier and left its thread interrupted) and so we are 
perpetually looping in {{execLoop}}, which ends up calling 
{{schedWaitCond.await();}}, which ends up throwing an {{InterruptedException}}, 
which ends up resetting the interrupt status of the thread, and rinse and 
repeat.

But again I wasn't able to get any cold hard evidence that this is what was 
happening.  There was just no other evidence that could explain this behavior, 
and I wasn't able to guess what else could be causing this that was consistent 
with what we saw and what I understood from reading the code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18042) Client Compatibility breaks between versions 1.2 and 1.3

2017-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011327#comment-16011327
 ] 

Benoit Sigoure commented on HBASE-18042:


We can easily update AsyncHBase to accommodate to the change however I would 
like to voice disagreement with this statement:
{quote}
It is an unfortunate thing that we have broken the semantics, but in general 
this is "allowed".
{quote}
Such semantic changes are like breaking API changes, they are, well, breaking 
changes.  Not cool.

One of the challenges with AsyncHBase is that it has to work with all versions 
of HBase.  Since {{more_results_in_region}} was already there in 1.2 but needs 
to be handled differently in 1.3, that makes it kinda hard for AsyncHBase to 
know how, exactly, to deal with this flag being set, right?

> Client Compatibility breaks between versions 1.2 and 1.3
> 
>
> Key: HBASE-18042
> URL: https://issues.apache.org/jira/browse/HBASE-18042
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Karan Mehta
>Assignee: Karan Mehta
>
> OpenTSDB uses AsyncHBase as its client, rather than using the traditional 
> HBase Client. From version 1.2 to 1.3, the {{ClientProtos}} have been 
> changed. Newer fields are added to {{ScanResponse}} proto.
> For a typical Scan request in 1.2, would require caller to make an 
> OpenScanner Request, GetNextRows Request and a CloseScanner Request, based on 
> {{more_rows}} boolean field in the {{ScanResponse}} proto.
> However, from 1.3, new parameter {{more_results_in_region}} was added, which 
> limits the results per region. Therefore the client has to now manage sending 
> all the requests for each region. Further more, if the results are exhausted 
> from a particular region, the {{ScanResponse}} will set 
> {{more_results_in_region}} to false, but {{more_results}} can still be true. 
> Whenever the former is set to false, the {{RegionScanner}} will also be 
> closed. 
> OpenTSDB makes an OpenScanner Request and receives all its results in the 
> first {{ScanResponse}} itself, thus creating a condition as described in 
> above paragraph. Since {{more_rows}} is true, it will proceed to send next 
> request at which point the {{RSRpcServices}} will throw 
> {{UnknownScannerException}}. The protobuf client compatibility is maintained 
> but expected behavior is modified.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-17489) ClientScanner may send a next request to a RegionScanner which has been exhausted

2017-05-08 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001480#comment-16001480
 ] 

Benoit Sigoure commented on HBASE-17489:


AsyncHBase expects the scanner ID in response to scanning more rows but that's 
not actually necessary.  I think I added this as a sanity check because I 
expected the server to always return the ID, but as was said above it's 
technically not strictly necessary for the server to return the ID on 
subsequent uses of the scanner.

The code doesn't even do anything with the scanner ID other than checking that 
it's the ID we expected:
{code}
@Override
Response deserialize(final ChannelBuffer buf, final int cell_size) {
  final ScanResponse resp = readProtobuf(buf, ScanResponse.PARSER);
  final long id = resp.getScannerId();
  if (scanner_id != id) {
throw new InvalidResponseException("Scan RPC response was for scanner"
   + " ID " + id + " but we expected"
   + scanner_id, resp);
  }
  final ArrayList rows = getRows(resp, buf, cell_size);
  if (rows == null) {
return null;
  }
  return new Response(resp.getScannerId(), rows, resp.getMoreResults());
}
{code}

I guess we could fix this by saying "if we have a scanner ID in the response 
THEN it must match the one expect" instead of "there must be a scanner ID in 
the response that matches what we expect".

Ironically we had the same bug in GoHBase, where we made the same assumption 
that the scanner ID was always present in the response.

> ClientScanner may send a next request to a RegionScanner which has been 
> exhausted
> -
>
> Key: HBASE-17489
> URL: https://issues.apache.org/jira/browse/HBASE-17489
> Project: HBase
>  Issue Type: Bug
>  Components: Client, scan
>Affects Versions: 2.0.0, 1.3.0, 1.4.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0, 1.4.0, 1.3.1
>
> Attachments: HBASE-17489-branch-1.3.patch, 
> HBASE-17489-branch-1.patch, HBASE-17489.patch, HBASE-17489-v1.patch, 
> HBASE-17489-v2.patch, HBASE-17489-v3.patch, HBASE-17489-v4.patch, 
> HBASE-17489-v4.patch, HBASE-17489-v5.patch, HBASE-17489-v6.patch
>
>
> Found it when implementing HBASE-17045. Seems the final result of the scan is 
> correct but no doubt the logic is broken. We need to fix it to stop things 
> get worse in the future.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-13329) ArrayIndexOutOfBoundsException in CellComparator#getMinimumMidpointArray

2015-07-14 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627056#comment-14627056
 ] 

Benoit Sigoure commented on HBASE-13329:


I'm kinda late to the party but yeah OpenTSDB compactions might cause long 
column qualifiers.  OpenTSDB doesn't generally use long row keys though, so 
that makes total sense.  Thanks for getting to the bottom of this one!

 ArrayIndexOutOfBoundsException in CellComparator#getMinimumMidpointArray
 

 Key: HBASE-13329
 URL: https://issues.apache.org/jira/browse/HBASE-13329
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 1.0.1
 Environment: linux-debian-jessie
 ec2 - t2.micro instances
Reporter: Ruben Aguiar
Assignee: Lars Hofhansl
Priority: Critical
 Fix For: 2.0.0, 1.0.2, 1.2.0, 1.1.2

 Attachments: 13329-asserts.patch, 13329-v1.patch, 13329.txt, 
 HBASE-13329.test.00.branch-1.1.patch


 While trying to benchmark my opentsdb cluster, I've created a script that 
 sends to hbase always the same value (in this case 1). After a few minutes, 
 the whole region server crashes and the region itself becomes impossible to 
 open again (cannot assign or unassign). After some investigation, what I saw 
 on the logs is that when a Memstore flush is called on a large region (128mb) 
 the process errors, killing the regionserver. On restart, replaying the edits 
 generates the same error, making the region unavailable. Tried to manually 
 unassign, assign or close_region. That didn't work because the code that 
 reads/replays it crashes.
 From my investigation this seems to be an overflow issue. The logs show that 
 the function getMinimumMidpointArray tried to access index -32743 of an 
 array, extremely close to the minimum short value in Java. Upon investigation 
 of the source code, it seems an index short is used, being incremented as 
 long as the two vectors are the same, probably making it overflow on large 
 vectors with equal data. Changing it to int should solve the problem.
 Here follows the hadoop logs of when the regionserver went down. Any help is 
 appreciated. Any other information you need please do tell me:
 2015-03-24 18:00:56,187 INFO  [regionserver//10.2.0.73:16020.logRoller] 
 wal.FSHLog: Rolled WAL 
 /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516
  with entries=143, filesize=134.70 MB; new WAL 
 /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140
 2015-03-24 18:00:56,188 INFO  [regionserver//10.2.0.73:16020.logRoller] 
 wal.FSHLog: Archiving 
 hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709
  to 
 hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709
 2015-03-24 18:04:35,722 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
 Started memstore flush for 
 tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region 
 memstore size 128.04 MB
 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: 
 ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. 
 Forcing server shutdown
 org.apache.hadoop.hbase.DroppedSnapshotException: region: 
 tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2.
   at 
 org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702)
   at 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445)
   at 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407)
   at 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69)
   at 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743
   at 
 org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478)
   at 
 org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
   at 
 org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
   at 
 org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
   at 
 org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
   at 
 org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
   at 
 

[jira] [Commented] (HBASE-13331) Exceptions from DFS client can cause CatalogJanitor to delete referenced files

2015-03-24 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378826#comment-14378826
 ] 

Benoit Sigoure commented on HBASE-13331:


[How Apache Hadoop is molesting IOException all 
day|http://blog.tsunanet.net/2012/04/apache-hadoop-abuse-ioexception.html]

 Exceptions from DFS client can cause CatalogJanitor to delete referenced files
 --

 Key: HBASE-13331
 URL: https://issues.apache.org/jira/browse/HBASE-13331
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 1.0.0, 0.98.12
Reporter: Elliott Clark
Assignee: Elliott Clark
Priority: Blocker
 Fix For: 2.0.0, 1.0.1, 1.1.0, 0.98.13

 Attachments: HBASE-13331.patch


 CatalogJanitor#checkDaughterInFs assumes that there are no references 
 whenever HRegionFileSystem.openRegionFromFileSystem throws IOException. Well 
 Hadoop and HBase throw IOExceptions whenever someone looks in their general 
 direction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13331) Exceptions from DFS client can cause CatalogJanitor to delete referenced files

2015-03-24 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378831#comment-14378831
 ] 

Benoit Sigoure commented on HBASE-13331:


Ah, see HBASE-5796

 Exceptions from DFS client can cause CatalogJanitor to delete referenced files
 --

 Key: HBASE-13331
 URL: https://issues.apache.org/jira/browse/HBASE-13331
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 1.0.0, 0.98.12
Reporter: Elliott Clark
Assignee: Elliott Clark
Priority: Blocker
 Fix For: 2.0.0, 1.0.1, 1.1.0, 0.98.13

 Attachments: HBASE-13331.patch


 CatalogJanitor#checkDaughterInFs assumes that there are no references 
 whenever HRegionFileSystem.openRegionFromFileSystem throws IOException. Well 
 Hadoop and HBase throw IOExceptions whenever someone looks in their general 
 direction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation

2014-08-08 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-5539:
--

Attachment: 0001-AsyncHBase-PerformanceEvaluation.patch

New patch for the latest of 0.96.

AFAICT this still hasn't been committed.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: Performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Fix For: 0.95.0

 Attachments: 0001-AsyncHBase-PerformanceEvaluation.patch, 
 0001-AsyncHBase-PerformanceEvaluation.patch, 
 0001-AsyncHBase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt, 
 5539-asynchbase-PerformanceEvaluation-v5.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11487) ScanResponse carries non-zero cellblock for CloseScanRequest (ScanRequest with close_scanner = true)

2014-07-09 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056955#comment-14056955
 ] 

Benoit Sigoure commented on HBASE-11487:


Thanks for the patch Shengzhe.

 ScanResponse carries non-zero cellblock for CloseScanRequest (ScanRequest 
 with close_scanner = true)
 

 Key: HBASE-11487
 URL: https://issues.apache.org/jira/browse/HBASE-11487
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC, regionserver
Affects Versions: 0.96.2, 0.99.0, 2.0.0
Reporter: Shengzhe Yao
Assignee: Shengzhe Yao
Priority: Minor
 Fix For: 2.0.0

 Attachments: HBase_11487_v1.patch


 After upgrading hbase from 0.94 to 0.96, we've found that our asynchbase 
 client keep throwing errors during normal scan. It turns out these errors are 
 due to Scanner.close call in asynchbase. Since asynchbase assumes the 
 ScanResponse of CloseScannerRequest should never carry any cellblocks, it 
 will throw an exception if there is a violation.
 In the asynchbase client (1.5.0), it constructs a CloseScannerRequest in the 
 following way,  
ScanRequest.newBuilder()
 .setScannerId(scanner_id)
 .setCloseScanner(true)
 .build();
 Note, it does not set numOfRows, which kind of make sense. Why a close 
 scanner request cares about number of rows to scan ?
 However, after narrowing down the CloseScannerRequest code path, it seems the 
 issue is on regionserver side. In RsRpcServices.scan, we always init 
 numOfRows to scan to 1 and we do this even for ScanRequest with close_scanner 
 = true. This causes response for CloseScannerRequest will carry a cellBlock 
 (if scan stops before the end row and this could happen in many normal 
 scenarios)
 There are two fixes, either we always set numOfRows in asynchbase client side 
 when constructing a CloseScannerRequest or we fix the default value in the 
 server side.
 From a hbase client side point of view, it seems make less sense that server 
 will send you a cellBlock for your close scanner request, unless the request 
 explicitly asks for. 
 We've made the change in our server code and the asynchbase client errors 
 goes away. 
 In addition to this issue, I want to know if we have any specifications for 
 our hbase rpc. Like if close_scanner = true in ScanRequest and numOfRows is 
 not set, ScanResponse guarantees that there is no cellBlock in the response. 
 Since we moved to protobuf and many fields are optional for compatibility 
 consideration, it might be helpful to have such specification which helps 
 people to develop code that depends on hbase rpc. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation

2014-01-26 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-5539:
--

Attachment: 0001-AsyncHBase-PerformanceEvaluation.patch

Updated patch for the 0.96 branch, that also builds with JDK 7.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: Performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Fix For: 0.95.0

 Attachments: 0001-AsyncHBase-PerformanceEvaluation.patch, 
 0001-AsyncHBase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt, 
 5539-asynchbase-PerformanceEvaluation-v5.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10422) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase

2014-01-26 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-10422:
--

 Summary: ZeroCopyLiteralByteString.zeroCopyGetBytes has an 
unusable prototype and conflicts with AsyncHBase
 Key: HBASE-10422
 URL: https://issues.apache.org/jira/browse/HBASE-10422
 Project: HBase
  Issue Type: Bug
  Components: Client, Protobufs
Affects Versions: 0.96.1.1, 0.96.1, 0.98.0
Reporter: Benoit Sigoure


In HBASE-9868 the trick that AsyncHBase uses to extract byte arrays from 
protobufs without copying them was ported, however the signature of 
{{zeroCopyGetBytes}} was changed for some reason.  There are two problems with 
the changed signature:
# It makes the helper function unusable since it refers to a package-private 
class.
# It clashes with the signature AsyncHBase expects, thereby making user's life 
miserable for those who pull in both AsyncHBase and HBase on their classpath.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10422) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase

2014-01-26 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-10422:
---

Attachment: 10422-Fix-the-signature-of-zeroCopyGetBytes-to-make-it-us.patch

Patch to fix the issue.

 ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and 
 conflicts with AsyncHBase
 --

 Key: HBASE-10422
 URL: https://issues.apache.org/jira/browse/HBASE-10422
 Project: HBase
  Issue Type: Bug
  Components: Client, Protobufs
Affects Versions: 0.98.0, 0.96.1, 0.96.1.1
Reporter: Benoit Sigoure
 Attachments: 
 10422-Fix-the-signature-of-zeroCopyGetBytes-to-make-it-us.patch


 In HBASE-9868 the trick that AsyncHBase uses to extract byte arrays from 
 protobufs without copying them was ported, however the signature of 
 {{zeroCopyGetBytes}} was changed for some reason.  There are two problems 
 with the changed signature:
 # It makes the helper function unusable since it refers to a package-private 
 class.
 # It clashes with the signature AsyncHBase expects, thereby making user's 
 life miserable for those who pull in both AsyncHBase and HBase on their 
 classpath.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10422) ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and conflicts with AsyncHBase

2014-01-26 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-10422:
---

Status: Patch Available  (was: Open)

 ZeroCopyLiteralByteString.zeroCopyGetBytes has an unusable prototype and 
 conflicts with AsyncHBase
 --

 Key: HBASE-10422
 URL: https://issues.apache.org/jira/browse/HBASE-10422
 Project: HBase
  Issue Type: Bug
  Components: Client, Protobufs
Affects Versions: 0.96.1.1, 0.96.1, 0.98.0
Reporter: Benoit Sigoure
 Attachments: 
 10422-Fix-the-signature-of-zeroCopyGetBytes-to-make-it-us.patch


 In HBASE-9868 the trick that AsyncHBase uses to extract byte arrays from 
 protobufs without copying them was ported, however the signature of 
 {{zeroCopyGetBytes}} was changed for some reason.  There are two problems 
 with the changed signature:
 # It makes the helper function unusable since it refers to a package-private 
 class.
 # It clashes with the signature AsyncHBase expects, thereby making user's 
 life miserable for those who pull in both AsyncHBase and HBase on their 
 classpath.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-10119) Allow HBase coprocessors to clean up when they fail

2013-12-10 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-10119:
--

 Summary: Allow HBase coprocessors to clean up when they fail
 Key: HBASE-10119
 URL: https://issues.apache.org/jira/browse/HBASE-10119
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Benoit Sigoure


In the thread [Giving a chance to buggy coprocessors to clean 
up|http://osdir.com/ml/general/2013-12/msg17334.html] I brought up the issue 
that coprocessors currently don't have a chance to release their own resources 
(be they internal resources within the JVM, or external resources elsewhere) 
when they get forcefully removed due to an uncaught exception escaping.

It would be nice to fix that, either by adding an API called by the 
{{CoprocessorHost}} when killing a faulty coprocessor, or by guaranteeing that 
the coprocessor's {{stop()}} method will be invoked then.

This feature request is actually pretty important due to bug HBASE-9046, which 
means that it's not possible to properly clean up a coprocessor without 
restarting the RegionServer (!!).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HBASE-10119) Allow HBase coprocessors to clean up when they fail

2013-12-10 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-10119:
---

Status: Patch Available  (was: Open)

 Allow HBase coprocessors to clean up when they fail
 ---

 Key: HBASE-10119
 URL: https://issues.apache.org/jira/browse/HBASE-10119
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Benoit Sigoure
 Attachments: HBASE-10119.patch


 In the thread [Giving a chance to buggy coprocessors to clean 
 up|http://osdir.com/ml/general/2013-12/msg17334.html] I brought up the issue 
 that coprocessors currently don't have a chance to release their own 
 resources (be they internal resources within the JVM, or external resources 
 elsewhere) when they get forcefully removed due to an uncaught exception 
 escaping.
 It would be nice to fix that, either by adding an API called by the 
 {{CoprocessorHost}} when killing a faulty coprocessor, or by guaranteeing 
 that the coprocessor's {{stop()}} method will be invoked then.
 This feature request is actually pretty important due to bug HBASE-9046, 
 which means that it's not possible to properly clean up a coprocessor without 
 restarting the RegionServer (!!).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HBASE-10119) Allow HBase coprocessors to clean up when they fail

2013-12-10 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-10119:
---

Attachment: HBASE-10119.patch

Tentative patch to address the issue by making sure we call the coprocessor's 
{{stop()}} when forcefully removing it.  This is the patch I'm using in 
production right now, it's working well for me.  Sorry I didn't have time to 
write the accompanying test.

 Allow HBase coprocessors to clean up when they fail
 ---

 Key: HBASE-10119
 URL: https://issues.apache.org/jira/browse/HBASE-10119
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.96.0
Reporter: Benoit Sigoure
 Attachments: HBASE-10119.patch


 In the thread [Giving a chance to buggy coprocessors to clean 
 up|http://osdir.com/ml/general/2013-12/msg17334.html] I brought up the issue 
 that coprocessors currently don't have a chance to release their own 
 resources (be they internal resources within the JVM, or external resources 
 elsewhere) when they get forcefully removed due to an uncaught exception 
 escaping.
 It would be nice to fix that, either by adding an API called by the 
 {{CoprocessorHost}} when killing a faulty coprocessor, or by guaranteeing 
 that the coprocessor's {{stop()}} method will be invoked then.
 This feature request is actually pretty important due to bug HBASE-9046, 
 which means that it's not possible to properly clean up a coprocessor without 
 restarting the RegionServer (!!).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HBASE-9941) The context ClassLoader isn't set while calling into a coprocessor

2013-12-09 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843974#comment-13843974
 ] 

Benoit Sigoure commented on HBASE-9941:
---

In any place.  For instance in {{prePut}} I could do {{new Foo()}} where the 
class {{Foo}} has never been used before, and thus only upon entering 
{{prePut}} would this class get loaded.

 The context ClassLoader isn't set while calling into a coprocessor
 --

 Key: HBASE-9941
 URL: https://issues.apache.org/jira/browse/HBASE-9941
 Project: HBase
  Issue Type: Sub-task
  Components: Coprocessors
Affects Versions: 0.96.0
Reporter: Benoit Sigoure
Assignee: Andrew Purtell
 Fix For: 0.98.0


 Whenever one of the methods of a coprocessor is invoked, the context 
 {{ClassLoader}} isn't set to be the {{CoprocessorClassLoader}}.  It's only 
 set properly when calling the coprocessor's {{start}} method.  This means 
 that if the coprocessor code attempts to load classes using the context 
 {{ClassLoader}}, it will fail to find the classes it's looking for.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HBASE-9046) Coprocessors can't be upgraded in service reliably

2013-12-09 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-9046:
--

Summary: Coprocessors can't be upgraded in service reliably  (was: Some 
region servers keep using an older version of coprocessor )

 Coprocessors can't be upgraded in service reliably
 --

 Key: HBASE-9046
 URL: https://issues.apache.org/jira/browse/HBASE-9046
 Project: HBase
  Issue Type: Sub-task
  Components: Coprocessors
Affects Versions: 0.94.8, 0.96.0
 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu 
 Mar 31 21:46:45 PDT 2011 amd64
 java version 1.6.0_07
 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
 Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
 hbase: 0.94.8, r1485407
 hadoop: 1.0.4, r1393290
Reporter: iain wright
Priority: Minor
 Fix For: 0.98.0


 My team and another user from the mailing list have run into an issue where 
 replacing the coprocessor jar in HDFS and reloading the table does not load 
 the latest jar. It may load the latest version on some percentage of RS but 
 not all of them.
 This may be a config oversight or a lack of understanding of a caching 
 mechanism that has a purge capability, but I thought I would log it here for 
 confirmation.
 Workaround is to name the coprocessor JAR uniquely, place in HDFS, and 
 re-enable the table using the new jar's name.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HBASE-10106) Remove some unnecessary code from TestOpenTableInCoprocessor

2013-12-08 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-10106:
--

 Summary: Remove some unnecessary code from 
TestOpenTableInCoprocessor
 Key: HBASE-10106
 URL: https://issues.apache.org/jira/browse/HBASE-10106
 Project: HBase
  Issue Type: Test
Affects Versions: 0.96.0
Reporter: Benoit Sigoure
Priority: Trivial


{code}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java
index 7bc2a78..67b97ce 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/TestOpenTableInCoprocessor.java
@@ -69,8 +69,6 @@ public class TestOpenTableInCoprocessor {
 public void prePut(final ObserverContextRegionCoprocessorEnvironment e, 
final Put put,
 final WALEdit edit, final Durability durability) throws IOException {
   HTableInterface table = e.getEnvironment().getTable(otherTable);
-  Put p = new Put(new byte[] { 'a' });
-  p.add(family, null, new byte[] { 'a' });
   table.put(put);
   table.flushCommits();
   completed[0] = true;
{code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10000) Initiate lease recovery for outstanding WAL files at the very beginning of recovery

2013-12-08 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842463#comment-13842463
 ] 

Benoit Sigoure commented on HBASE-1:


Damn I can't believe I missed issue #1.  Congrats everyone for filing so 
many bugs!

 Initiate lease recovery for outstanding WAL files at the very beginning of 
 recovery
 ---

 Key: HBASE-1
 URL: https://issues.apache.org/jira/browse/HBASE-1
 Project: HBase
  Issue Type: Improvement
Reporter: Ted Yu
Assignee: Ted Yu
 Fix For: 0.98.1

 Attachments: 1-recover-ts-with-pb-2.txt, 
 1-recover-ts-with-pb-3.txt, 1-recover-ts-with-pb-4.txt, 
 1-recover-ts-with-pb-5.txt, 1-v4.txt, 1-v5.txt, 1-v6.txt


 At the beginning of recovery, master can send lease recovery requests 
 concurrently for outstanding WAL files using a thread pool.
 Each split worker would first check whether the WAL file it processes is 
 closed.
 Thanks to Nicolas Liochon and Jeffery discussion with whom gave rise to this 
 idea. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HBASE-9046) Some region servers keep using an older version of coprocessor

2013-11-10 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-9046:
--

Affects Version/s: 0.96.0

 Some region servers keep using an older version of coprocessor 
 ---

 Key: HBASE-9046
 URL: https://issues.apache.org/jira/browse/HBASE-9046
 Project: HBase
  Issue Type: Bug
  Components: Coprocessors
Affects Versions: 0.94.8, 0.96.0
 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu 
 Mar 31 21:46:45 PDT 2011 amd64
 java version 1.6.0_07
 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
 Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
 hbase: 0.94.8, r1485407
 hadoop: 1.0.4, r1393290
Reporter: iain wright
Priority: Minor

 My team and another user from the mailing list have run into an issue where 
 replacing the coprocessor jar in HDFS and reloading the table does not load 
 the latest jar. It may load the latest version on some percentage of RS but 
 not all of them.
 This may be a config oversight or a lack of understanding of a caching 
 mechanism that has a purge capability, but I thought I would log it here for 
 confirmation.
 Workaround is to name the coprocessor JAR uniquely, place in HDFS, and 
 re-enable the table using the new jar's name.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9046) Some region servers keep using an older version of coprocessor

2013-11-10 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818526#comment-13818526
 ] 

Benoit Sigoure commented on HBASE-9046:
---

I think the problem is that {{CoprocessorClassLoader.classLoadersCache}} 
retains the previous cache loader in its cache.  This is a cache that maps the 
path of the .jar file to its corresponding {{CoprocessorClassLoader}}.  The 
values in the cache are weak references, but that doesn't guarantee that they 
will go away in a timely fashion.  Therefore if you edit the schema of your 
table to unset the coprocessor and re-set it, most of the time you will get the 
same {{CoprocessorClassLoader}} as before and the new jar won't be loaded.  I 
can reproduce this trivially and consistently on a single-node non-distributed 
HBase instance.

 Some region servers keep using an older version of coprocessor 
 ---

 Key: HBASE-9046
 URL: https://issues.apache.org/jira/browse/HBASE-9046
 Project: HBase
  Issue Type: Bug
  Components: Coprocessors
Affects Versions: 0.94.8, 0.96.0
 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu 
 Mar 31 21:46:45 PDT 2011 amd64
 java version 1.6.0_07
 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
 Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
 hbase: 0.94.8, r1485407
 hadoop: 1.0.4, r1393290
Reporter: iain wright
Priority: Minor

 My team and another user from the mailing list have run into an issue where 
 replacing the coprocessor jar in HDFS and reloading the table does not load 
 the latest jar. It may load the latest version on some percentage of RS but 
 not all of them.
 This may be a config oversight or a lack of understanding of a caching 
 mechanism that has a purge capability, but I thought I would log it here for 
 confirmation.
 Workaround is to name the coprocessor JAR uniquely, place in HDFS, and 
 re-enable the table using the new jar's name.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9046) Some region servers keep using an older version of coprocessor

2013-11-10 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818530#comment-13818530
 ] 

Benoit Sigoure commented on HBASE-9046:
---

I can further confirm this because in my current environment I use a single 
coprocessor, so I devised a workaround for this bug: my coprocessor class has a 
{{static int}} I use as a reference count: every time my coprocessor's 
{{start}} is called, I increment it, and in {{stop}} I decrement it.  In 
{{stop}}, when the count drops down to 0, I call 
{{CoprocessorClassLoader.clearCache()}}.  This fixes the problem for me.  This 
trick doesn't work for multiple co-processors, because {{clearCache()}} would 
clear everything.

Also note that {{clearCache()}} is only exposed for testing purposes so it's 
technically not part of the public API.

Another workaround I can think of (but haven't tried) would be to use 
reflection to access the underlying map and clear out the entry.

I think the right way to fix this bug is to maintain the reference count 
manually by doing the increment/decrement from the {{startup()}} and 
{{shutdown()}} methods of {{CoprocessorHost$Environment}}.

 Some region servers keep using an older version of coprocessor 
 ---

 Key: HBASE-9046
 URL: https://issues.apache.org/jira/browse/HBASE-9046
 Project: HBase
  Issue Type: Bug
  Components: Coprocessors
Affects Versions: 0.94.8, 0.96.0
 Environment: FreeBSD 8.2-RELEASE FreeBSD 8.2-RELEASE #0 r220198: Thu 
 Mar 31 21:46:45 PDT 2011 amd64
 java version 1.6.0_07
 Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
 Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
 hbase: 0.94.8, r1485407
 hadoop: 1.0.4, r1393290
Reporter: iain wright
Priority: Minor

 My team and another user from the mailing list have run into an issue where 
 replacing the coprocessor jar in HDFS and reloading the table does not load 
 the latest jar. It may load the latest version on some percentage of RS but 
 not all of them.
 This may be a config oversight or a lack of understanding of a caching 
 mechanism that has a purge capability, but I thought I would log it here for 
 confirmation.
 Workaround is to name the coprocessor JAR uniquely, place in HDFS, and 
 re-enable the table using the new jar's name.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HBASE-9941) The context ClassLoader isn't set while calling into a coprocessor

2013-11-10 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-9941:
-

 Summary: The context ClassLoader isn't set while calling into a 
coprocessor
 Key: HBASE-9941
 URL: https://issues.apache.org/jira/browse/HBASE-9941
 Project: HBase
  Issue Type: Bug
  Components: Coprocessors
Affects Versions: 0.96.0
Reporter: Benoit Sigoure


Whenever one of the methods of a coprocessor is invoked, the context 
{{ClassLoader}} isn't set to be the {{CoprocessorClassLoader}}.  It's only set 
properly when calling the coprocessor's {{start}} method.  This means that if 
the coprocessor code attempts to load classes using the context 
{{ClassLoader}}, it will fail to find the classes it's looking for.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HBASE-9936) Table get stuck when it fails to open due to a coprocessor error

2013-11-09 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-9936:
-

 Summary: Table get stuck when it fails to open due to a 
coprocessor error
 Key: HBASE-9936
 URL: https://issues.apache.org/jira/browse/HBASE-9936
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.96.0
Reporter: Benoit Sigoure


I made a mistake while after re-enabling a table on which I did an `alter' to 
add a coprocessor: the .jar I specified wasn't a self-contained jar, and thus 
some dependent classes couldn't be found.

{code}
2013-11-09 02:39:05,994 INFO  [AM.ZK.Worker-pool2-t17] master.RegionStates: 
Transitioned {8568640c1da6ce0d5e27b656d28fe9fd state=PENDING_OPEN, 
ts=1383993545988, server=192.168.42.108,59570,1383993435386} to 
{8568640c1da6ce0d5e27b656d28fe9fd state=OPENING, ts=1383993545994, 
server=192.168.42.108,59570,1383993435386}
2013-11-09 02:39:05,995 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
coprocessor.CoprocessorHost: Loading coprocessor class 
com.example.foo.hbase.FooCoprocessor with path 
/Users/tsuna/src/foo/target/scala-2.10/foo_2.10-0.1.jar and priority 1000
2013-11-09 02:39:06,005 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Finding class: com.example.foo.hbase.FooCoprocessor
2013-11-09 02:39:06,006 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Skipping exempt class 
org.apache.hadoop.hbase.coprocessor.BaseRegionObserver - delegating directly to 
parent
2013-11-09 02:39:06,007 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Skipping exempt class java.lang.Object - 
delegating directly to parent
2013-11-09 02:39:06,007 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Finding class: org.slf4j.LoggerFactory
2013-11-09 02:39:06,007 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Class org.slf4j.LoggerFactory not found - 
delegating to parent
2013-11-09 02:39:06,008 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Finding class: 
scala.collection.mutable.StringBuilder
2013-11-09 02:39:06,008 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Class scala.collection.mutable.StringBuilder not 
found - delegating to parent
2013-11-09 02:39:06,008 DEBUG [RS_OPEN_REGION-192.168.42.108:59570-2] 
util.CoprocessorClassLoader: Class scala.collection.mutable.StringBuilder not 
found in parent loader
2013-11-09 02:39:06,008 ERROR [RS_OPEN_REGION-192.168.42.108:59570-2] 
handler.OpenRegionHandler: Failed open of 
region=foo,,1383899959121.8568640c1da6ce0d5e27b656d28fe9fd., starting to roll 
back the global memstore size.
java.lang.IllegalStateException: Could not instantiate a region instance.
at 
org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3820)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4078)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4030)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3981)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:475)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:140)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3817)
... 9 more
Caused by: java.lang.NoClassDefFoundError: 
scala/collection/mutable/StringBuilder
at com.example.foo.hbase.FooCoprocessor.start(FooCoprocessor.scala:18)
at 
org.apache.hadoop.hbase.coprocessor.CoprocessorHost$Environment.startup(CoprocessorHost.java:636)
at 
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.loadInstance(CoprocessorHost.java:259)
at 
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:212)
at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:192)
at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.init(RegionCoprocessorHost.java:154)
at 

[jira] [Created] (HBASE-9879) Can't undelete a KeyValue

2013-11-01 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-9879:
-

 Summary: Can't undelete a KeyValue
 Key: HBASE-9879
 URL: https://issues.apache.org/jira/browse/HBASE-9879
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.96.0
Reporter: Benoit Sigoure


Test scenario:

put(KV, timestamp=100)
put(KV, timestamp=200)
delete(KV, timestamp=200, with MutationProto.DeleteType.DELETE_ONE_VERSION)
get(KV) = returns value at timestamp=100 (OK)
put(KV, timestamp=200)
get(KV) = returns value at timestamp=100 (but not the one at timestamp=200 
that was reborn by the previous put)

Is that normal?

I ran into this bug while running the integration tests at 
https://github.com/OpenTSDB/asynchbase/pull/60 – the first time you run it, it 
passes, but after that, it keeps failing.  Sorry I don't have the corresponding 
HTable-based code but that should be fairly easy to write.

I only tested this with 0.96.0, dunno yet how this behaved in prior releases.

My hunch is that the tombstone added by the DELETE_ONE_VERSION keeps shadowing 
the value even after it's reborn.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation

2013-10-29 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807827#comment-13807827
 ] 

Benoit Sigoure commented on HBASE-5539:
---

It doesn't look like this was ever committed.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: Performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Fix For: 0.95.0

 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt, 
 5539-asynchbase-PerformanceEvaluation-v5.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation

2013-10-29 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-5539:
--

Attachment: 0001-AsyncHBase-PerformanceEvaluation.patch

Patch for 0.96 of the changes that fell through the cracks.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: Performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Fix For: 0.95.0

 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-AsyncHBase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt, 
 5539-asynchbase-PerformanceEvaluation-v5.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (HBASE-9710) Use the region name, not the encoded name, in exceptions

2013-10-03 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure reassigned HBASE-9710:
-

Assignee: Benoit Sigoure

 Use the region name, not the encoded name, in exceptions
 

 Key: HBASE-9710
 URL: https://issues.apache.org/jira/browse/HBASE-9710
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.95.2, 0.96.0
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: 
 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch


 When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} 
 we put the encoded region name in the exception, which isn't super useful.  I 
 propose putting the region name instead.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HBASE-9710) Use the region name, not the encoded name, in exceptions

2013-10-03 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-9710:
-

 Summary: Use the region name, not the encoded name, in exceptions
 Key: HBASE-9710
 URL: https://issues.apache.org/jira/browse/HBASE-9710
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.95.2, 0.96.0
Reporter: Benoit Sigoure
Priority: Minor
 Attachments: 
0001-Log-the-region-name-instead-of-the-encoded-region-na.patch

When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} 
we put the encoded region name in the exception, which isn't super useful.  I 
propose putting the region name instead.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HBASE-9710) Use the region name, not the encoded name, in exceptions

2013-10-03 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-9710:
--

Status: Patch Available  (was: Open)

 Use the region name, not the encoded name, in exceptions
 

 Key: HBASE-9710
 URL: https://issues.apache.org/jira/browse/HBASE-9710
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.95.2, 0.96.0
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: 
 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch


 When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} 
 we put the encoded region name in the exception, which isn't super useful.  I 
 propose putting the region name instead.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HBASE-9710) Use the region name, not the encoded name, in exceptions

2013-10-03 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-9710:
--

Attachment: 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch

Proposed patch.

 Use the region name, not the encoded name, in exceptions
 

 Key: HBASE-9710
 URL: https://issues.apache.org/jira/browse/HBASE-9710
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.95.2, 0.96.0
Reporter: Benoit Sigoure
Priority: Minor
 Attachments: 
 0001-Log-the-region-name-instead-of-the-encoded-region-na.patch


 When we throw a {{RegionOpeningException}} or a {{NotServingRegionException}} 
 we put the encoded region name in the exception, which isn't super useful.  I 
 propose putting the region name instead.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9612) Ability to batch edits destined to different regions

2013-10-02 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784832#comment-13784832
 ] 

Benoit Sigoure commented on HBASE-9612:
---

Sorry to see the far reaching consequences this change has.  If we were to 
re-do this from scratch (so assuming that the batch call didn't exist), would 
you have a multi RPC that does only-edits (instead of a mix of edits and 
gets) because that would be simpler?

I don't have a strong feeling on mixing edits and gets, but I believe being 
able to batch edits across regions in one RPC call is pretty important.

 Ability to batch edits destined to different regions
 

 Key: HBASE-9612
 URL: https://issues.apache.org/jira/browse/HBASE-9612
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.0, 0.95.1, 0.95.2, 0.96.0
Reporter: Benoit Sigoure
Assignee: stack
Priority: Critical
 Fix For: 0.98.0, 0.96.0

 Attachments: 
 0001-fix-packaging-by-region-in-MultiServerCallable.patch, 9612.096.v5.txt, 
 9612revert.txt, 9612v2.txt, 9612v3.txt, 9612v4.txt, 9612v5.txt, 9612v5.txt, 
 9612v5.txt, 9612v7.txt, 9612v8.096.txt, 9612v8.txt, 9612v9.txt, 9612v9.txt, 
 9612.wip.txt


 The old (pre-PB) multi and multiPut RPCs allowed one to batch edits 
 destined to different regions.  Seems like we've lost this ability after the 
 switch to protobufs.
 The {{MultiRequest}} only contains one {{RegionSpecifier}}, and a list of 
 {{MultiAction}}.  The {{MultiAction}} message is contains either a single 
 {{MutationProto}} or a {{Get}} (but not both – so its name is misleading as 
 there is nothing multi about it).  Also it seems redundant with 
 {{MultiGetRequest}}, I'm not sure what's the point of supporting {{Get}} in 
 {{MultiAction}}.
 I propose that we change {{MultiRequest}} to be a just a list of 
 {{MultiAction}}, and {{MultiAction}} will contain the {{RegionSpecifier}}, 
 the {{bool atomic}} and a list of {{MutationProto}}.  This would be a 
 non-backward compatible protobuf change.
 If we want we can support mixing edits and reads, in which case we'd also add 
 a list of {{Get}} in {{MultiAction}}, and we'd have support having both that 
 list and the list of {{MutationProto}} set at the same time.  But this is a 
 bonus and can be done later (in a backward compatible manner, hence no need 
 to rush on this one).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HBASE-9612) Ability to batch edits destined to different regions

2013-09-21 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-9612:
-

 Summary: Ability to batch edits destined to different regions
 Key: HBASE-9612
 URL: https://issues.apache.org/jira/browse/HBASE-9612
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.2, 0.95.1, 0.95.0, 0.96.0
Reporter: Benoit Sigoure


The old (pre-PB) multi and multiPut RPCs allowed one to batch edits 
destined to different regions.  Seems like we've lost this ability after the 
switch to protobufs.

The {{MultiRequest}} only contains one {{RegionSpecifier}}, and a list of 
{{MultiAction}}.  The {{MultiAction}} message is contains either a single 
{{MutationProto}} or a {{Get}} (but not both – so its name is misleading as 
there is nothing multi about it).  Also it seems redundant with 
{{MultiGetRequest}}, I'm not sure what's the point of supporting {{Get}} in 
{{MultiAction}}.

I propose that we change {{MultiRequest}} to be a just a list of 
{{MultiAction}}, and {{MultiAction}} will contain the {{RegionSpecifier}}, the 
{{bool atomic}} and a list of {{MutationProto}}.  This would be a non-backward 
compatible protobuf change.

If we want we can support mixing edits and reads, in which case we'd also add a 
list of {{Get}} in {{MultiAction}}, and we'd have support having both that list 
and the list of {{MutationProto}} set at the same time.  But this is a bonus 
and can be done later (in a backward compatible manner, hence no need to rush 
on this one).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8958) Sometimes we refer to the single .META. table region as .META.,,1 and other times as .META.,,1.1028785192

2013-07-25 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719349#comment-13719349
 ] 

Benoit Sigoure commented on HBASE-8958:
---

I noticed that if I refer to META as {{.META.,,1.1028785192}} instead of 
{{.META.,,1}} it doesn't work (the RS sends back an NSRE).

 Sometimes we refer to the single .META. table region as .META.,,1 and other 
 times as .META.,,1.1028785192 
 --

 Key: HBASE-8958
 URL: https://issues.apache.org/jira/browse/HBASE-8958
 Project: HBase
  Issue Type: Bug
Reporter: stack
 Fix For: 0.95.2


 See here how we say in a log:
 {code}
 2013-07-15 22:32:53,805 INFO  [main] regionserver.HRegion(4176): Open 
 {ENCODED = 1028785192, NAME = '.META.,,1', STARTKEY = '', ENDKEY = ''}
 {code}
 but when we open other regions we do:
 {code}
 764 2013-07-15 22:40:10,867 INFO  [RS_OPEN_REGION-durruti:61987-0] 
 regionserver.HRegion: Open {ENCODED = 93dad2bbf6ff5ea0d7477f504b303346, NAME 
 = 'x,,1373953210791.93dad2bbf6ff5ea0d7477f504b303346.', ...
 {code}
 Note how in the second, the name includes the encoded name.
 We'll also do :
 {code}
 2013-07-15 22:32:53,810 INFO  [main] regionserver.HRegion(629): Onlined 
 1028785192/.META.; next sequenceid=1
 {code}
 vs
 {code}
 785 2013-07-15 22:40:10,885 INFO  [AM.ZK.Worker-pool-2-thread-7] 
 master.RegionStates: Onlined 93dad2bbf6ff5ea0d7477f504b303346 on 
 durruti,61987,1373947581222
 {code}
 ... where we print the encoded name.
 Master web UI shows .META.,,1.1028785192
 Benoit originally noticed this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9006) RPC code requires cellBlockCodecClass even though one is defined by default

2013-07-20 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-9006:
-

 Summary: RPC code requires cellBlockCodecClass even though one is 
defined by default
 Key: HBASE-9006
 URL: https://issues.apache.org/jira/browse/HBASE-9006
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.95.1
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor


The protobuf definition provides a default value:

{code}
// This is sent on connection setup after the connection preamble is sent.
message ConnectionHeader {
  [...]
  optional string cellBlockCodecClass = 3 [default = 
org.apache.hadoop.hbase.codec.KeyValueCodec];
  // Compressor we will use if cell block is compressed.  Server will throw 
exception if not supported.
  // Class must implement hadoop's CompressionCodec Interface
  [...]
}
{code}

Yet if one doesn't explicitly set a value, the code was rejecting the 
connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9006) RPC code requires cellBlockCodecClass even though one is defined by default

2013-07-20 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-9006:
--

Attachment: 0001-Don-t-require-a-cellBlockCodecClass-in-the-RPC-heade.patch

Patch that fixes the issue.

 RPC code requires cellBlockCodecClass even though one is defined by default
 ---

 Key: HBASE-9006
 URL: https://issues.apache.org/jira/browse/HBASE-9006
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.95.1
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: 
 0001-Don-t-require-a-cellBlockCodecClass-in-the-RPC-heade.patch


 The protobuf definition provides a default value:
 {code}
 // This is sent on connection setup after the connection preamble is sent.
 message ConnectionHeader {
   [...]
   optional string cellBlockCodecClass = 3 [default = 
 org.apache.hadoop.hbase.codec.KeyValueCodec];
   // Compressor we will use if cell block is compressed.  Server will throw 
 exception if not supported.
   // Class must implement hadoop's CompressionCodec Interface
   [...]
 }
 {code}
 Yet if one doesn't explicitly set a value, the code was rejecting the 
 connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9006) RPC code requires cellBlockCodecClass even though one is defined by default

2013-07-20 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-9006:
--

Status: Patch Available  (was: Open)

 RPC code requires cellBlockCodecClass even though one is defined by default
 ---

 Key: HBASE-9006
 URL: https://issues.apache.org/jira/browse/HBASE-9006
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.95.1
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: 
 0001-Don-t-require-a-cellBlockCodecClass-in-the-RPC-heade.patch


 The protobuf definition provides a default value:
 {code}
 // This is sent on connection setup after the connection preamble is sent.
 message ConnectionHeader {
   [...]
   optional string cellBlockCodecClass = 3 [default = 
 org.apache.hadoop.hbase.codec.KeyValueCodec];
   // Compressor we will use if cell block is compressed.  Server will throw 
 exception if not supported.
   // Class must implement hadoop's CompressionCodec Interface
   [...]
 }
 {code}
 Yet if one doesn't explicitly set a value, the code was rejecting the 
 connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9001) TestThriftServerCmdLine.testRunThriftServer[0] failed

2013-07-20 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714369#comment-13714369
 ] 

Benoit Sigoure commented on HBASE-9001:
---

IT'S OVER 9000!!!

 TestThriftServerCmdLine.testRunThriftServer[0] failed
 -

 Key: HBASE-9001
 URL: https://issues.apache.org/jira/browse/HBASE-9001
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
 Fix For: 0.95.2

 Attachments: 9001.txt


 https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/624/testReport/junit/org.apache.hadoop.hbase.thrift/TestThriftServerCmdLine/testRunThriftServer_0_/
 It seems stuck here:
 {code}
 2013-07-19 03:52:03,158 INFO  [Thread-131] 
 thrift.TestThriftServerCmdLine(132): Starting HBase Thrift server with 
 command line: -hsha -port 56708 start
 2013-07-19 03:52:03,174 INFO  [ThriftServer-cmdline] 
 thrift.ThriftServerRunner$ImplType(208): Using thrift server type hsha
 2013-07-19 03:52:03,205 WARN  [ThriftServer-cmdline] conf.Configuration(817): 
 fs.default.name is deprecated. Instead, use fs.defaultFS
 2013-07-19 03:52:03,206 WARN  [ThriftServer-cmdline] conf.Configuration(817): 
 mapreduce.job.counters.limit is deprecated. Instead, use 
 mapreduce.job.counters.max
 2013-07-19 03:52:03,207 WARN  [ThriftServer-cmdline] conf.Configuration(817): 
 io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
 2013-07-19 03:54:03,156 INFO  [pool-1-thread-1] hbase.ResourceChecker(171): 
 after: thrift.TestThriftServerCmdLine#testRunThriftServer[0] Thread=146 (was 
 155), OpenFileDescriptor=295 (was 311), MaxFileDescriptor=4096 (was 4096), 
 SystemLoadAverage=293 (was 240) - SystemLoadAverage LEAK? -, ProcessCount=145 
 (was 143) - ProcessCount LEAK? -, AvailableMemoryMB=779 (was 1263), 
 ConnectionCount=4 (was 4)
 2013-07-19 03:54:03,157 DEBUG [pool-1-thread-1] 
 thrift.TestThriftServerCmdLine(107): implType=-hsha, specifyFramed=false, 
 specifyBindIP=false, specifyCompact=true
 {code}
 My guess is that we didn't get scheduled because load was almost 300 on this 
 box at the time?
 Let me up the timeout of two minutes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever

2013-07-15 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-8952:
--

Attachment: HBASE-8952.patch

Patch to fix the issue in the 0.95 branch.

 Missing error handling can cause RegionServer RPC thread to busy loop forever
 -

 Key: HBASE-8952
 URL: https://issues.apache.org/jira/browse/HBASE-8952
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
 Attachments: HBASE-8952.patch


 This bug seems to be present in all released versions of HBase, including the 
 tip of the 0.94 and 0.95 branches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever

2013-07-15 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-8952:
-

 Summary: Missing error handling can cause RegionServer RPC thread 
to busy loop forever
 Key: HBASE-8952
 URL: https://issues.apache.org/jira/browse/HBASE-8952
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
 Attachments: HBASE-8952.patch

This bug seems to be present in all released versions of HBase, including the 
tip of the 0.94 and 0.95 branches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever

2013-07-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708874#comment-13708874
 ] 

Benoit Sigoure commented on HBASE-8952:
---

I attached a patch that fixes the issue for me in 0.95 – the patch would need 
to be ported to other branches as well as to the secure RPC implementation, 
which is separate in 0.94 ({{ipc/SecureServer.java}}).

 Missing error handling can cause RegionServer RPC thread to busy loop forever
 -

 Key: HBASE-8952
 URL: https://issues.apache.org/jira/browse/HBASE-8952
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
 Attachments: HBASE-8952.patch


 This bug seems to be present in all released versions of HBase, including the 
 tip of the 0.94 and 0.95 branches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever

2013-07-15 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-8952:
--

Status: Patch Available  (was: Open)

 Missing error handling can cause RegionServer RPC thread to busy loop forever
 -

 Key: HBASE-8952
 URL: https://issues.apache.org/jira/browse/HBASE-8952
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
 Attachments: HBASE-8952.patch


 This bug seems to be present in all released versions of HBase, including the 
 tip of the 0.94 and 0.95 branches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8952) Missing error handling can cause RegionServer RPC thread to busy loop forever

2013-07-15 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-8952:
--

Description: 
If the connection to the client is closed unexpectedly and at the wrong time, 
the code will attempt to keep reading from the socket in a busy loop.

This bug seems to be present in all released versions of HBase, including the 
tip of the 0.94 and 0.95 branches, however I only ran into it while porting 
AsyncHBase to 0.95

  was:This bug seems to be present in all released versions of HBase, including 
the tip of the 0.94 and 0.95 branches.


 Missing error handling can cause RegionServer RPC thread to busy loop forever
 -

 Key: HBASE-8952
 URL: https://issues.apache.org/jira/browse/HBASE-8952
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
 Attachments: HBASE-8952.patch


 If the connection to the client is closed unexpectedly and at the wrong 
 time, the code will attempt to keep reading from the socket in a busy loop.
 This bug seems to be present in all released versions of HBase, including the 
 tip of the 0.94 and 0.95 branches, however I only ran into it while porting 
 AsyncHBase to 0.95

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6290) Add a function a mark a server as dead and start the recovery the process

2013-04-25 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641541#comment-13641541
 ] 

Benoit Sigoure commented on HBASE-6290:
---

Yes, that's right.  It would be great to have this sort of kill switch, both at 
the HBase level, as well as HDFS.  The feature I presented works especially 
well to tell all interested parties (clients) that the node they're trying to 
reach is dead, but often it doesn't help time out the node out of the cluster, 
e.g. in HDFS or MapReduce, the NameNode and JobTracker will ignore TCP resets 
and will not flag the node as really dead until some long pre-configured 
timeout elapses.

 Add a function a mark a server as dead and start the recovery the process
 -

 Key: HBASE-6290
 URL: https://issues.apache.org/jira/browse/HBASE-6290
 Project: HBase
  Issue Type: Improvement
  Components: monitoring
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Minor
  Labels: noob

 ZooKeeper is used a a monitoring tool: we use znode and we start the recovery 
 process when a znode is deleted by ZK because it got a timeout. This timeout 
 is defaulted to 90 seconds, and often set to 30s
 However, some HW issues could be detected by specialized hw monitoring tools 
 before the ZK timeout. For this reason, it makes sense to offer a very simple 
 function to mark a RS as dead. This should not take in
 It could be a hbase shell function such as
 considerAsDead ipAddress|serverName
 This would delete all the znodes of the server running on this box, starting 
 the recovery process.
 Such a function would be easily callable (at callers risk) by any fault 
 detection tool... We could have issues to identify the right master  region 
 servers around ipv4 vs ipv6 vs and multi networked boxes however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6504) Adding GC details prevents HBase from starting in non-distributed mode

2012-09-14 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456124#comment-13456124
 ] 

Benoit Sigoure commented on HBASE-6504:
---

Yeah.  One minor nit though: the form {{head -1}} is deprecated (and has been 
for years).  Better to use {{head -n 1}}.

 Adding GC details prevents HBase from starting in non-distributed mode
 --

 Key: HBASE-6504
 URL: https://issues.apache.org/jira/browse/HBASE-6504
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.0
Reporter: Benoit Sigoure
Assignee: Michael Drzal
Priority: Trivial
  Labels: noob
 Attachments: HBASE-6504-output.txt, HBASE-6504.patch


 The {{conf/hbase-env.sh}} that ships with HBase contains a few commented out 
 examples of variables that could be useful, such as adding 
 {{-XX:+PrintGCDetails -XX:+PrintGCDateStamps}} to {{HBASE_OPTS}}.  This has 
 the annoying side effect that the JVM prints a summary of memory usage when 
 it exits, and it does so on stdout:
 {code}
 $ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool 
 hbase.cluster.distributed
 false
 Heap
  par new generation   total 19136K, used 4908K [0x00073a20, 
 0x00073b6c, 0x00075186)
   eden space 17024K,  28% used [0x00073a20, 0x00073a6cb0a8, 
 0x00073b2a)
   from space 2112K,   0% used [0x00073b2a, 0x00073b2a, 
 0x00073b4b)
   to   space 2112K,   0% used [0x00073b4b, 0x00073b4b, 
 0x00073b6c)
  concurrent mark-sweep generation total 63872K, used 0K [0x00075186, 
 0x0007556c, 0x0007f5a0)
  concurrent-mark-sweep perm gen total 21248K, used 6994K [0x0007f5a0, 
 0x0007f6ec, 0x0008)
 $ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool 
 hbase.cluster.distributed /dev/null
 (nothing printed)
 {code}
 And this confuses {{bin/start-hbase.sh}} when it does
 {{distMode=`$bin/hbase --config $HBASE_CONF_DIR 
 org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed`}}, 
 because then the {{distMode}} variable is not just set to {{false}}, it also 
 contains all this JVM spam.
 If you don't pay enough attention and realize that 3 processes are getting 
 started (ZK, HM, RS) instead of just one (HM), then you end up with this 
 confusing error message:
 {{Could not start ZK at requested port of 2181.  ZK was started at port: 
 2182.  Aborting as clients (e.g. shell) will not be able to find this ZK 
 quorum.}}, which is even more puzzling because when you run {{netstat}} to 
 see who owns that port, then you won't find any rogue process other than the 
 one you just started.
 I'm wondering if the fix is not to just change the {{if [ $distMode == 
 'false' ]}} to a {{switch $distMode case (false*)}} type of test, to work 
 around this annoying JVM misfeature that pollutes stdout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6586) Quarantine Corrupted HFiles

2012-08-23 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440930#comment-13440930
 ] 

Benoit Sigoure commented on HBASE-6586:
---

Not sure why {{HBaseIOException}} would be added in a JIRA about Quarantine 
Corrupted HFiles, but yes that would be nice to have.

 Quarantine Corrupted HFiles
 ---

 Key: HBASE-6586
 URL: https://issues.apache.org/jira/browse/HBASE-6586
 Project: HBase
  Issue Type: Improvement
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
 Attachments: 0001-hbase-6568-hbck-quarantine-v6.patch, 
 hbase-6586-92-v3.patch, hbase-6586-92-v8.patch, hbase-6586-94-v3.patch, 
 hbase-6586-94-v8.patch, hbase-6586.patch, hbase-6586-trunk-v3.patch, 
 hbase-6586-trunk-v4.patch, hbase-6586-trunk-v5.patch, 
 hbase-6586-trunk-v6.patch, hbase-6586-trunk-v7.patch, 
 hbase-6586-trunk-v8.patch


 We've encountered a few upgrades from 0.90 hbases + 20.2/1.x hdfs to 0.92 
 hbases + hdfs 2.x that get stuck.  I haven't been able to duplicate the 
 problem in my dev environment but we suspect this may be related to 
 HDFS-3731.  On the HBase side, it seems reasonable to quarantine what are 
 most likely truncated hfiles, so that can could later be recovered.
 Here's an example of the exception we've encountered:
 {code}
 2012-07-18 05:55:01,152 ERROR handler.OpenRegionHandler 
 (OpenRegionHandler.java:openRegion(346)) - Failed open of 
 region=user_mappings,080112102AA76EF98197605D341B9E6C5824D2BC|1001,1317824890618.eaed0e7abc6d27d28ff0e5a9b49c4c
 0d. 
 java.io.IOException: java.lang.IllegalArgumentException: Invalid HFile 
 version: 842220600 (expected to be between 1 and 2) 
 at 
 org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:306)
  
 at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:371) 
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:387) 
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1026)
  
 at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:485) 
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:566)
  
 at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:286) 
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:223) 
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2534)
  
 at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:454) 
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3282) 
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3230) 
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:331)
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:107)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169) 
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  
 at java.lang.Thread.run(Thread.java:619) 
 Caused by: java.lang.IllegalArgumentException: Invalid HFile version: 
 842220600 (expected to be between 1 and 2) 
 at org.apache.hadoop.hbase.io.hfile.HFile.checkFormatVersion(HFile.java:515) 
 at 
 org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:303)
  
 ... 17 more
 {code}
 Specifically -- the FixedFileTrailer are incorrect, and seemingly missing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6504) Adding GC details prevents HBase from starting in non-distributed mode

2012-08-02 Thread Benoit Sigoure (JIRA)
Benoit Sigoure created HBASE-6504:
-

 Summary: Adding GC details prevents HBase from starting in 
non-distributed mode
 Key: HBASE-6504
 URL: https://issues.apache.org/jira/browse/HBASE-6504
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.0
Reporter: Benoit Sigoure
Priority: Trivial


The {{conf/hbase-env.sh}} that ships with HBase contains a few commented out 
examples of variables that could be useful, such as adding 
{{-XX:+PrintGCDetails -XX:+PrintGCDateStamps}} to {{HBASE_OPTS}}.  This has the 
annoying side effect that the JVM prints a summary of memory usage when it 
exits, and it does so on stdout:

{code}
$ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool 
hbase.cluster.distributed
false
Heap
 par new generation   total 19136K, used 4908K [0x00073a20, 
0x00073b6c, 0x00075186)
  eden space 17024K,  28% used [0x00073a20, 0x00073a6cb0a8, 
0x00073b2a)
  from space 2112K,   0% used [0x00073b2a, 0x00073b2a, 
0x00073b4b)
  to   space 2112K,   0% used [0x00073b4b, 0x00073b4b, 
0x00073b6c)
 concurrent mark-sweep generation total 63872K, used 0K [0x00075186, 
0x0007556c, 0x0007f5a0)
 concurrent-mark-sweep perm gen total 21248K, used 6994K [0x0007f5a0, 
0x0007f6ec, 0x0008)
$ ./bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool 
hbase.cluster.distributed /dev/null
(nothing printed)
{code}

And this confuses {{bin/start-hbase.sh}} when it does
{{distMode=`$bin/hbase --config $HBASE_CONF_DIR 
org.apache.hadoop.hbase.util.HBaseConfTool hbase.cluster.distributed`}}, 
because then the {{distMode}} variable is not just set to {{false}}, it also 
contains all this JVM spam.

If you don't pay enough attention and realize that 3 processes are getting 
started (ZK, HM, RS) instead of just one (HM), then you end up with this 
confusing error message:
{{Could not start ZK at requested port of 2181.  ZK was started at port: 2182.  
Aborting as clients (e.g. shell) will not be able to find this ZK quorum.}}, 
which is even more puzzling because when you run {{netstat}} to see who owns 
that port, then you won't find any rogue process other than the one you just 
started.

I'm wondering if the fix is not to just change the {{if [ $distMode == 
'false' ]}} to a {{switch $distMode case (false*)}} type of test, to work 
around this annoying JVM misfeature that pollutes stdout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-2877) Unnecessary byte written when serializing a Writable RPC parameter

2012-07-19 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-2877:
--

Affects Version/s: 0.90.0
   0.90.1
   0.90.2
   0.90.3
   0.90.4
   0.90.5
   0.90.6
   0.92.0
   0.92.1
   0.94.0

 Unnecessary byte written when serializing a Writable RPC parameter
 --

 Key: HBASE-2877
 URL: https://issues.apache.org/jira/browse/HBASE-2877
 Project: HBase
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.20.5, 0.89.20100621, 0.90.0, 0.90.1, 0.90.2, 0.90.3, 
 0.90.4, 0.90.5, 0.90.6, 0.92.0, 0.92.1, 0.94.0
Reporter: Benoit Sigoure
Priority: Minor

 When {{HbaseObjectWritable#writeObject}} serializes a {{Writable}} RPC 
 parameter, it writes its class code twice to the wire.  {{writeClassCode}} 
 is already called once unconditionally at the beginning of the method, and 
 for {{Writable}} arguments, it's called a second time towards the end of the 
 method.  It seems that the code is trying to deal with the declared type 
 vs. actual type of a parameter.  The Hadoop RPC code was already doing this 
 before Stack changed it to use codes in r608738 for HADOOP-2519.  It's not 
 documented when this is useful though, and I couldn't find any use case.  
 Every RPC I've seen so far just ends up with the same byte sent twice to the 
 wire.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6239) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row

2012-07-12 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412564#comment-13412564
 ] 

Benoit Sigoure commented on HBASE-6239:
---

This means HBase replication will still corrupt timestamps in 0.90.7, which in 
many cases makes replication useless.  Are you sure?

 [replication] ReplicationSink uses the ts of the first KV for the other KVs 
 in the same row
 ---

 Key: HBASE-6239
 URL: https://issues.apache.org/jira/browse/HBASE-6239
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6, 0.92.1
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Critical
  Labels: corruption
 Fix For: 0.92.2, 0.90.8

 Attachments: HBASE-6239-0.92-v1.patch


 ReplicationSink assumes that all the KVs for the same row inside a WALEdit 
 will have the same timestamp, which is not necessarily the case.
 This only affects 0.90 and 0.92 since HBASE-5203 fixes it in 0.94

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6239) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row

2012-06-22 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399417#comment-13399417
 ] 

Benoit Sigoure commented on HBASE-6239:
---

I would argue that this bug is not minor, because we're talking about data 
being corrupted by HBase.

 [replication] ReplicationSink uses the ts of the first KV for the other KVs 
 in the same row
 ---

 Key: HBASE-6239
 URL: https://issues.apache.org/jira/browse/HBASE-6239
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6, 0.92.1
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Minor
  Labels: corruption
 Fix For: 0.92.2

 Attachments: HBASE-6239-0.92-v1.patch


 ReplicationSink assumes that all the KVs for the same row inside a WALEdit 
 will have the same timestamp, which is not necessarily the case.
 This only affects 0.90 and 0.92 since HBASE-5203 fixes it in 0.94

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation

2012-06-21 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398880#comment-13398880
 ] 

Benoit Sigoure commented on HBASE-5539:
---

+1, thanks.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation

2012-06-21 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398982#comment-13398982
 ] 

Benoit Sigoure commented on HBASE-5539:
---

Yeah someone first needs to commit HBASE-5539.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation

2012-06-21 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399011#comment-13399011
 ] 

Benoit Sigoure commented on HBASE-5539:
---

Err...  I meant this issue depends on HBASE-5527 to be committed first.  That's 
why you were missing the method {{getNumClientThreads()}} earlier.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation

2012-06-21 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399112#comment-13399112
 ] 

Benoit Sigoure commented on HBASE-5539:
---

Thanks for the commit.  In the future please keep the changes in separate 
commits if they are in separate issues.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch, 
 5539-asynchbase-PerformanceEvaluation-v2.txt, 
 5539-asynchbase-PerformanceEvaluation-v3.txt, 
 5539-asynchbase-PerformanceEvaluation-v4.txt, 
 5539-asynchbase-PerformanceEvaluation-v5.txt


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5527) PerformanceEvaluation: Report aggregate timings on a single line

2012-06-21 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399113#comment-13399113
 ] 

Benoit Sigoure commented on HBASE-5527:
---

Using {{nanoTime}} isn't only nice, it's also more correct :)

{{currentTimeMillis}} depends on system time and is not monotonic, whereas 
{{nanoTime}} is almost always implemented with a proper monotic clock (although 
I think technically this isn't _guaranteed_, but in practice on all reasonable 
platforms, it's the case).

 PerformanceEvaluation: Report aggregate timings on a single line
 

 Key: HBASE-5527
 URL: https://issues.apache.org/jira/browse/HBASE-5527
 Project: HBase
  Issue Type: Improvement
  Components: performance
Affects Versions: 0.92.0
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: 0001-PerformanceEvaluation-fixes.patch, 
 0001-PerformanceEvaluation-fixes.patch


 When running {{PerformanceEvaluation}} with {{--nomapred}} it's hard to 
 locate all the lines saying {{Finished 14 in 292979ms writing 200 rows}} 
 in the output.  This change adds a couples line to summarize the run at the 
 end, which makes parsing and scripting the output easier:
 {code}
 12/03/06 00:43:58 INFO hbase.PerformanceEvaluation: [RandomWriteTest] Summary 
 of timings (ms): [15940, 15776, 15866, 15973, 15682, 15740, 15764, 15830, 
 15768, 15968, 15921, 15755, 15963, 15818, 15903, 15662]
 12/03/06 00:43:58 INFO hbase.PerformanceEvaluation: [RandomWriteTest] Min: 
 15662msMax: 15973msAvg: 15833ms
 {code}
 Patch also removes a couple minor code smells.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6239) [replication] ReplicationSink uses the ts of the first KV for the other KVs in the same row

2012-06-19 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-6239:
--

Labels: corruption  (was: )

 [replication] ReplicationSink uses the ts of the first KV for the other KVs 
 in the same row
 ---

 Key: HBASE-6239
 URL: https://issues.apache.org/jira/browse/HBASE-6239
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6, 0.92.1
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Minor
  Labels: corruption
 Fix For: 0.92.2

 Attachments: HBASE-6239-0.92-v1.patch


 ReplicationSink assumes that all the KVs for the same row inside a WALEdit 
 will have the same timestamp, which is not necessarily the case.
 This only affects 0.90 and 0.92 since HBASE-5203 fixes it in 0.94

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-3170) RegionServer confused about empty row keys

2012-06-16 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-3170:
--

Affects Version/s: 0.90.0
   0.90.1
   0.90.2
   0.90.3
   0.90.4
   0.90.5
   0.90.6
   0.92.0
   0.92.1

 RegionServer confused about empty row keys
 --

 Key: HBASE-3170
 URL: https://issues.apache.org/jira/browse/HBASE-3170
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.89.20100621, 0.89.20100924, 0.90.0, 0.90.1, 0.90.2, 
 0.90.3, 0.90.4, 0.90.5, 0.90.6, 0.92.0, 0.92.1
Reporter: Benoit Sigoure

 I'm no longer sure about the expected behavior when using an empty row key 
 (e.g. a 0-byte long byte array).  I assumed that this was a legitimate row 
 key, just like having an empty column qualifier is allowed.  But it seems 
 that the RegionServer considers the empty row key to be whatever the first 
 row key is.
 {code}
 Version: 0.89.20100830, r0da2890b242584a8a5648d83532742ca7243346b, Sat Sep 18 
 15:30:09 PDT 2010
 hbase(main):001:0 scan 'tsdb-uid', {LIMIT = 1}
 ROW   COLUMN+CELL 
  
  \x00 column=id:metrics, timestamp=1288375187699, 
 value=foo  
  \x00 column=id:tagk, timestamp=1287522021046, 
 value=bar 
  \x00 column=id:tagv, timestamp=1288111387685, 
 value=qux  
 1 row(s) in 0.4610 seconds
 hbase(main):002:0 get 'tsdb-uid', ''
 COLUMNCELL
  
  id:metrics   timestamp=1288375187699, value=foo  

  id:tagk  timestamp=1287522021046, value=bar  

  id:tagv  timestamp=1288111387685, value=qux  
 
 3 row(s) in 0.0910 seconds
 hbase(main):003:0 get 'tsdb-uid', \000
 COLUMNCELL
  
  id:metrics   timestamp=1288375187699, value=foo  

  id:tagk  timestamp=1287522021046, value=bar  

  id:tagv  timestamp=1288111387685, value=qux  
 
 3 row(s) in 0.0550 seconds
 {code}
 This isn't a parsing problem with the command-line of the shell.  I can 
 reproduce this behavior both with plain Java code and with my asynchbase 
 client.
 Since I don't actually have a row with an empty row key, I expected that the 
 first {{get}} would return nothing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation

2012-06-12 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-5539:
--

Attachment: 0001-asynchbase-PerformanceEvaluation.patch

Updated patch with new {{pom.xml}}.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5539) asynchbase PerformanceEvaluation

2012-06-12 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-5539:
--

Status: Patch Available  (was: Open)

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch, 
 0001-asynchbase-PerformanceEvaluation.patch


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5539) asynchbase PerformanceEvaluation

2012-05-15 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13275887#comment-13275887
 ] 

Benoit Sigoure commented on HBASE-5539:
---

Yeah I have preliminary results at http://goo.gl/mZAcK – it shows that 
asynchbase can be quite a bit faster than {{HTable}}, surprisingly perhaps 
especially for read-heavy workloads, as well as for write-heavy workloads with 
many threads, where {{HTable}} suffers from really poor concurrency.

BTW, asynchbase 1.3.0 has been released, so the patch I attached originally to 
this issue needs to be updated to change the dependency to be on that version.  
I'll post a new patch soon, unless someone beats me to it.

 asynchbase PerformanceEvaluation
 

 Key: HBASE-5539
 URL: https://issues.apache.org/jira/browse/HBASE-5539
 Project: HBase
  Issue Type: New Feature
  Components: performance
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
  Labels: benchmark
 Attachments: 0001-asynchbase-PerformanceEvaluation.patch


 I plugged [asynchbase|https://github.com/stumbleupon/asynchbase] into 
 {{PerformanceEvaluation}}.  This enables testing asynchbase from 
 {{PerformanceEvaluation}} and comparing its performance to {{HTable}}.  Also 
 asynchbase doesn't come with any benchmark, so it was good that I was able to 
 plug it into {{PerformanceEvaluation}} relatively easily.
 I am in the processing of collecting results on a dev cluster running 0.92.1 
 and will publish them once they're ready.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-2489) Make the Filesystem needs to be upgraded error message more useful.

2012-04-22 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259035#comment-13259035
 ] 

Benoit Sigoure commented on HBASE-2489:
---

@chenning, this isn't a good place to ask for support, send an email to the 
user mailing list.

At this point you should be using a version of HBase that's much newer than the 
one in which this fix was made, so you shouldn't need to apply this fix.

 Make the Filesystem needs to be upgraded error message more useful.
 -

 Key: HBASE-2489
 URL: https://issues.apache.org/jira/browse/HBASE-2489
 Project: HBase
  Issue Type: Improvement
  Components: util
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Trivial
 Fix For: 0.90.0

 Attachments: 
 0001-Improve-the-error-message-File-system-needs-to-be-up.patch


 The other day, when starting HBase I got this error:
 {noformat}
 2010-04-23 09:38:14,847 ERROR org.apache.hadoop.hbase.master.HMaster: Failed 
 to start master
 org.apache.hadoop.hbase.util.FileSystemVersionException: File system needs to 
 be upgraded. Run the '${HBASE_HOME}/bin/hbase migrate' script.
 at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:187)
 {noformat}
 I was puzzled until I realized, after adding extra debug statements in the 
 code, that I forgot to properly set {{hbase.rootdir}} after re-deploying my 
 dev environment.  I think the message above was misleading and I'm proposing 
 a trivial patch to make it a little bit better:
 {noformat}
 2010-04-23 09:48:29,000 ERROR org.apache.hadoop.hbase.master.HMaster: Failed 
 to start master
 org.apache.hadoop.hbase.util.FileSystemVersionException: File system needs to 
 be upgraded.  You have version null and I want version 7.  Run the 
 '${HBASE_HOME}/bin/hbase migrate' script.
 at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:189)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3581) hbase rpc should send size of response

2011-09-16 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13105894#comment-13105894
 ] 

Benoit Sigoure commented on HBASE-3581:
---

+1, thank you Stack.

 hbase rpc should send size of response
 --

 Key: HBASE-3581
 URL: https://issues.apache.org/jira/browse/HBASE-3581
 Project: HBase
  Issue Type: Improvement
Reporter: ryan rawson
Assignee: stack
Priority: Critical
 Fix For: 0.92.0

 Attachments: 3581-v2.txt, HBASE-rpc-response.txt


 The RPC reply from Server-Client does not include the size of the payload, 
 it is framed like so:
 i32 callId
 byte errorFlag
 byte[] data
 The data segment would contain enough info about how big the response is so 
 that it could be decoded by a writable reader.
 This makes it difficult to write buffering clients, who might read the entire 
 'data' then pass it to a decoder. While less memory efficient, if you want to 
 easily write block read clients (eg: nio) it would be necessary to send the 
 size along so that the client could snarf into a local buf.
 The new proposal is:
 i32 callId
 i32 size
 byte errorFlag
 byte[] data
 the size being sizeof(data) + sizeof(errorFlag).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-4236) Don't lock the stream while serializing the response

2011-08-20 Thread Benoit Sigoure (JIRA)
Don't lock the stream while serializing the response


 Key: HBASE-4236
 URL: https://issues.apache.org/jira/browse/HBASE-4236
 Project: HBase
  Issue Type: Improvement
  Components: ipc
Affects Versions: 0.90.4
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor


It is not necessary to hold the lock on the stream while the response is being 
serialized.  This unnecessarily prevents serializing responses in parallel.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-4237) Directly remove the call being handled from the map of outstanding RPCs

2011-08-20 Thread Benoit Sigoure (JIRA)
Directly remove the call being handled from the map of outstanding RPCs
---

 Key: HBASE-4237
 URL: https://issues.apache.org/jira/browse/HBASE-4237
 Project: HBase
  Issue Type: Improvement
  Components: ipc
Affects Versions: 0.90.4
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor


The client has to maintain a map of RPC ID to `Call' object for this RPC, for 
every outstanding RPC.  When receiving a response, the client was getting the 
`Call' out of the map (one O(log n) operation) and then removing it from the 
map (another O(log n) operation).  There is no benefit in not removing it 
directly from the map.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4237) Directly remove the call being handled from the map of outstanding RPCs

2011-08-20 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088299#comment-13088299
 ] 

Benoit Sigoure commented on HBASE-4237:
---

Patch @ 
https://github.com/tsuna/hbase/commit/1f602391ee4cd3d11eaf3067208caeadf214b3a8

 Directly remove the call being handled from the map of outstanding RPCs
 ---

 Key: HBASE-4237
 URL: https://issues.apache.org/jira/browse/HBASE-4237
 Project: HBase
  Issue Type: Improvement
  Components: ipc
Affects Versions: 0.90.4
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor

 The client has to maintain a map of RPC ID to `Call' object for this RPC, for 
 every outstanding RPC.  When receiving a response, the client was getting the 
 `Call' out of the map (one O(log n) operation) and then removing it from the 
 map (another O(log n) operation).  There is no benefit in not removing it 
 directly from the map.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-2321) Support RPC interface changes at runtime

2011-08-17 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-2321:
--

Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed])

This breaks RPC compatibility.

 Support RPC interface changes at runtime
 

 Key: HBASE-2321
 URL: https://issues.apache.org/jira/browse/HBASE-2321
 Project: HBase
  Issue Type: Improvement
  Components: coprocessors
Reporter: Andrew Purtell
Assignee: Gary Helmling
 Fix For: 0.92.0


 Now we are able to append methods to interfaces without breaking RPC 
 compatibility with earlier releases. However there is no way that I am aware 
 of to dynamically add entire new RPC interfaces. Methods/parameters are fixed 
 to the class used to instantiate the server at that time. Coprocessors need 
 this. They will extend functionality on regions in arbitrary ways. How to 
 support that on the client side? A couple of options:
 1. New RPC from scratch.
 2. Modify HBaseServer such that multiple interface objects can be used for 
 reflection and objects can be added or removed at runtime. 
 3. Have the coprocessor host instantiate new HBaseServer instances on 
 ephemeral ports and publish the endpoints to clients via Zookeeper. Couple 
 this with a small modification to HBaseServer to support elastic thread pools 
 to minimize the number of threads that might be kept around in the JVM. 
 4. Add a generic method to HRegionInterface, an ioctl-like construction, 
 which accepts a ImmutableBytesWritable key and an array of Writable as 
 parameters. 
 My opinion is we should opt for #4 as it is the simplest and most expedient 
 approach. I could also do #3 if consensus prefers. Really we should do #1 but 
 it's not clear who has the time for that at the moment. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-3973) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY

2011-06-09 Thread Benoit Sigoure (JIRA)
HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY
--

 Key: HBASE-3973
 URL: https://issues.apache.org/jira/browse/HBASE-3973
 Project: HBase
  Issue Type: Improvement
  Components: shell
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor


In the HBase shell, when the output isn't a TTY, the shell assumes the 
terminal to be 100 characters wide.  The way the shell wraps things around 
makes it very hard to script the output of the shell (e.g. redirect the output 
to a file and then work on that file, or pipe the output to another command).

When stdout isn't a TTY, the shell shouldn't try to wrap things around.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3973) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY

2011-06-09 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-3973:
--

Status: Patch Available  (was: Open)

 HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY
 --

 Key: HBASE-3973
 URL: https://issues.apache.org/jira/browse/HBASE-3973
 Project: HBase
  Issue Type: Improvement
  Components: shell
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: hbase-hirb-formatter.patch


 In the HBase shell, when the output isn't a TTY, the shell assumes the 
 terminal to be 100 characters wide.  The way the shell wraps things around 
 makes it very hard to script the output of the shell (e.g. redirect the 
 output to a file and then work on that file, or pipe the output to another 
 command).
 When stdout isn't a TTY, the shell shouldn't try to wrap things around.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3973) HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY

2011-06-09 Thread Benoit Sigoure (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-3973:
--

Attachment: hbase-hirb-formatter.patch

Patch to fix the issue.

 HBase IRB shell: Don't pretty-print the output when stdout isn't a TTY
 --

 Key: HBASE-3973
 URL: https://issues.apache.org/jira/browse/HBASE-3973
 Project: HBase
  Issue Type: Improvement
  Components: shell
Reporter: Benoit Sigoure
Assignee: Benoit Sigoure
Priority: Minor
 Attachments: hbase-hirb-formatter.patch


 In the HBase shell, when the output isn't a TTY, the shell assumes the 
 terminal to be 100 characters wide.  The way the shell wraps things around 
 makes it very hard to script the output of the shell (e.g. redirect the 
 output to a file and then work on that file, or pipe the output to another 
 command).
 When stdout isn't a TTY, the shell shouldn't try to wrap things around.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3859) Increment a counter when a Scanner lease expires

2011-05-05 Thread Benoit Sigoure (JIRA)
Increment a counter when a Scanner lease expires


 Key: HBASE-3859
 URL: https://issues.apache.org/jira/browse/HBASE-3859
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.90.2
Reporter: Benoit Sigoure
Priority: Minor


Whenever a Scanner lease expires, the RegionServer will close it automatically 
and log a message to complain.  I would like the RegionServer to increment a 
counter whenever this happens and expose this counter through the metrics 
system, so we can plug this into our monitoring system (OpenTSDB) and keep 
track of how frequently this happens.  It's not supposed to happen frequently 
so it's good to keep an eye on it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3850) Log more details when a scanner lease expires

2011-05-03 Thread Benoit Sigoure (JIRA)
Log more details when a scanner lease expires
-

 Key: HBASE-3850
 URL: https://issues.apache.org/jira/browse/HBASE-3850
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Benoit Sigoure
Priority: Minor


The message logged by the RegionServer when a Scanner lease expires isn't as 
useful as it could be.  {{Scanner 4765412385779771089 lease expired}} - most 
clients don't log their scanner ID, so it's really hard to figure out what was 
going on.  I think it would be useful to at least log the name of the region on 
which the Scanner was open, and it would be great to have the ip:port of the 
client that had that lease too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3732) New configuration option for client-side compression

2011-04-19 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13021675#comment-13021675
 ] 

Benoit Sigoure commented on HBASE-3732:
---

Sounds good Stack.

 New configuration option for client-side compression
 

 Key: HBASE-3732
 URL: https://issues.apache.org/jira/browse/HBASE-3732
 Project: HBase
  Issue Type: New Feature
Reporter: Jean-Daniel Cryans
 Fix For: 0.92.0


 We have a case here where we have to store very fat cells (arrays of 
 integers) which can amount into the hundreds of KBs that we need to read 
 often, concurrently, and possibly keep in cache. Compressing the values on 
 the client using java.util.zip's Deflater before sending them to HBase proved 
 to be in our case almost an order of magnitude faster.
 There reasons are evident: less data sent to hbase, memstore contains 
 compressed data, block cache contains compressed data too, etc.
 I was thinking that it might be something useful to add to a family schema, 
 so that Put/Result do the conversion for you. The actual compression algo 
 should also be configurable.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3732) New configuration option for client-side compression

2011-04-05 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016021#comment-13016021
 ] 

Benoit Sigoure commented on HBASE-3732:
---

Oh yeah I forgot that this was in the {{info:regioninfo}} column, my bad.

Wouldn't it be awesome if this was actually on a key-per-key basis?  Is there a 
spare bit in {{KeyValue}} we can steal to indicate this KV is compressed?  We 
could not only compress the value, but also the column qualifier and/or the key 
if they're big too (some applications store data in the column qualifier or, 
less frequently, in the key).

 New configuration option for client-side compression
 

 Key: HBASE-3732
 URL: https://issues.apache.org/jira/browse/HBASE-3732
 Project: HBase
  Issue Type: New Feature
Reporter: Jean-Daniel Cryans
 Fix For: 0.92.0


 We have a case here where we have to store very fat cells (arrays of 
 integers) which can amount into the hundreds of KBs that we need to read 
 often, concurrently, and possibly keep in cache. Compressing the values on 
 the client using java.util.zip's Deflater before sending them to HBase proved 
 to be in our case almost an order of magnitude faster.
 There reasons are evident: less data sent to hbase, memstore contains 
 compressed data, block cache contains compressed data too, etc.
 I was thinking that it might be something useful to add to a family schema, 
 so that Put/Result do the conversion for you. The actual compression algo 
 should also be configurable.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3732) New configuration option for client-side compression

2011-04-04 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015753#comment-13015753
 ] 

Benoit Sigoure commented on HBASE-3732:
---

If you want {{Put}}/{{Result}} to do the conversion for you, that means the 
client needs to be aware of the schema of the table before it can start using 
it, right?  Because right now HBase clients don't know the schema, so it's 
something extra that they'd need to lookup separately, unless we add new fields 
in the {{.META.}} table that go along with each and every region.

 New configuration option for client-side compression
 

 Key: HBASE-3732
 URL: https://issues.apache.org/jira/browse/HBASE-3732
 Project: HBase
  Issue Type: New Feature
Reporter: Jean-Daniel Cryans
 Fix For: 0.92.0


 We have a case here where we have to store very fat cells (arrays of 
 integers) which can amount into the hundreds of KBs that we need to read 
 often, concurrently, and possibly keep in cache. Compressing the values on 
 the client using java.util.zip's Deflater before sending them to HBase proved 
 to be in our case almost an order of magnitude faster.
 There reasons are evident: less data sent to hbase, memstore contains 
 compressed data, block cache contains compressed data too, etc.
 I was thinking that it might be something useful to add to a family schema, 
 so that Put/Result do the conversion for you. The actual compression algo 
 should also be configurable.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-2170) hbase lightweight client library as a distribution

2011-03-19 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008872#comment-13008872
 ] 

Benoit Sigoure commented on HBASE-2170:
---

bq. This is an impressive number. Just curious if u were able to run the same 
benchmark with WAL turned on, and what numbers you see then..

Curiously enough, I see the same numbers.

This is the 1st import I did Thursday (no WAL)
{code}
$ ./src/tsdb import /tmp/data.gz
[...]
2011-03-17 18:45:51,797 INFO  [main] TextImporter: ... 100 data points in 
6688ms (149521.5 points/s)
2011-03-17 18:45:56,836 INFO  [main] TextImporter: ... 200 data points in 
5044ms (198255.4 points/s)
2011-03-17 18:46:01,823 INFO  [main] TextImporter: ... 300 data points in 
4986ms (200561.6 points/s)
2011-03-17 18:46:06,848 INFO  [main] TextImporter: ... 400 data points in 
5025ms (199005.0 points/s)
2011-03-17 18:46:11,865 INFO  [main] TextImporter: ... 500 data points in 
5016ms (199362.0 points/s)
2011-03-17 18:46:14,315 INFO  [main] TextImporter: Processed /tmp/data.gz in 
29211 ms, 5487065 data points (187842.4 points/s)
2011-03-17 18:46:14,315 INFO  [main] TextImporter: Total: imported 5487065 data 
points in 29.212s (187838.4 points/s)
{code}
Note: 1 data point = 1 {{KeyValue}}.

I commented out {{dp.setBatchImport(true);}} in 
[TextImporter.getDataPoints|https://github.com/stumbleupon/opentsdb/blob/master/src/tools/TextImporter.java#L225]
 and ran the same import again.  Note: this isn't exactly an apples-to-apples 
comparison because I'm going to overwrite existing {{KeyValue}} instead of 
creating new ones.  The table has {{VERSIONS=1}} but I think we disabled major 
compactions so we don't delete old data (Stack/JD correct me if I'm mistaken 
about our setup).
{code}
$ ./src/tsdb import /tmp/data.gz
[...]
2011-03-19 19:09:36,102 INFO  [main] TextImporter: ... 100 data points in 
6699ms (149276.0 points/s)
2011-03-19 19:09:41,101 INFO  [main] TextImporter: ... 200 data points in 
5004ms (199840.1 points/s)
2011-03-19 19:09:46,051 INFO  [main] TextImporter: ... 300 data points in 
4949ms (202061.0 points/s)
2011-03-19 19:09:51,006 INFO  [main] TextImporter: ... 400 data points in 
4955ms (201816.3 points/s)
2011-03-19 19:09:56,017 INFO  [main] TextImporter: ... 500 data points in 
5010ms (199600.8 points/s)
2011-03-19 19:09:58,422 INFO  [main] TextImporter: Processed /tmp/data.gz in 
29025 ms, 5487065 data points (189046.2 points/s)
2011-03-19 19:09:58,422 INFO  [main] TextImporter: Total: imported 5487065 data 
points in 29.026s (189041.3 points/s)
{code}

So... this totally surprises me.  I expected to see a big performance drop with 
the WAL enabled.  I wondered if I didn't properly recompile the code or if 
something else was still disabling the WAL, but I verified with {{strace}} that 
the WAL was turned on in the RPC that was going out:
{code}
$ strace -f -e trace=write -s 4096 ./src/tsdb import /tmp/data.gz
[...]
[pid 21364] write(32, 
\0\0\312\313\0\0\0\3\0\10multiPut\0\0\0\00199\0\0\0\1Btsdb,\0\3\371L\301[\360\0\0\7\0\2;,1300586854474.a2a283a471dfcf5dcda82d05f2d468ed.\0\0\0:\1\r\0\3\371MZ2\200\0\0\7\0\0\216\177\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\1\0\0\0\1\1t\0\0\0(\0\0\6\340\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\0{\177\377\377\377\377\377\377\377\4\0\0\0\0C\0\0\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\1k\177\377\377\377\377\377\377\377\4\0\0\0\0Cd...
{code}
This shows that the WAL is enabled.  Having the source of [asynchbase's 
{{MultiPutRequest}}|https://github.com/stumbleupon/asynchbase/blob/master/src/MultiPutRequest.java#L274]
 greatly helps make sense of this otherwise impossible to understand blob:
* We can easily see where the region name is, it contains an MD5 sum followed 
by a period ({{.}}).
* After the region name, the next 4 bytes are the number of edits for this 
region: {{\0\0\0:}} = 58
* Then there's a byte with value 1 with the versioning of the {{Put}} object: 
{{\1}}
* Then there's a the row key of the row we're writing to: 
{{\r\0\3\371MZ2\200\0\0\7\0\0\216}} where:
** {{\r}} is a {{vint}} indicating that the key length is 13 bytes
** The first 3 bytes of the row key in OpenTSDB correspond to the metric ID: 
{{\0\3\371}}
** The next 4 bytes in OpenTSDB correspond to a UNIX timestamp: {{MZ2\200}}.  
Using Python, it's easy to confirm that:
{code}
 import struct
 import time
 struct.unpack(I, MZ2\200)
(1297756800,)
 time.ctime(*_)
'Tue Feb 15 00:00:00 2011'
{code}
** The next 6 bytes in OpenTSDB correspond to a tag:
*** 3 bytes for a tag name ID: {{\0\0\7}}
*** 3 bytes for a tag value ID: {{\0\0\216}}
* Then we have the timestamp of the edit, which is unset, so it's 
{{Long.MAX_VALUE}} which is {{\177\377\377\377\377\377\377\377}}
* Then we have the {{RowLock}} ID.  In this case no row lock is involved, so 
the value is {{-1L}}: 

[jira] Commented: (HBASE-3671) Split report before we finish parent region open; workaround till 0.92; Race between split and OPENED processing

2011-03-18 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008574#comment-13008574
 ] 

Benoit Sigoure commented on HBASE-3671:
---

+1 too, thanks for the quick turnaround guys!

 Split report before we finish parent region open; workaround till 0.92; Race 
 between split and OPENED processing
 

 Key: HBASE-3671
 URL: https://issues.apache.org/jira/browse/HBASE-3671
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: stack
 Attachments: 3671.txt


 This issue is about adding a workaround to 0.90 until we get proper fix in 
 0.92 (HBASE-3559).
 Here is the sequence of events:
 1. We start to process OPENED region event.
 2. We receive a SPLIT of this region report.
 3. SPLIT processing offline the region and onlines daughters.
 4. Metascanner runs and clears out the region from .META. deleting it
 5. The OPENED handler runs.  Marks the region online in Master memory.
 6. Balancer runs.  Trys to balance a region that has been deleted.
 Loops for ever.
 Here is excerpt from logs.  It happened during startup, lots going on.  Could 
 happen on regionserver crash I suppose, maybe, but we're susceptible during 
 cluster start:
 {code}
 # We assign the region
 2011-03-16 15:18:29,053 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x22e286f0b9c98f1 Async create of unassigned node for 
 3516b74d0c9d4458c2f2f715249e3f78 with OFFLINE state
 ...
 2011-03-16 15:18:32,298 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
  
 rs=tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.
  state=OFFLINE, ts=1300313909053, server=sv4borg39,60020,1300313564807
 ...
 2011-03-16 15:18:32,732 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
  
 rs=tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.
  state=OFFLINE, ts=1300313909053
 ...
 2011-03-16 15:23:02,114 DEBUG 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
 master:6-0x22e286f0b9c98f1 Received ZooKeeper Event, 
 type=NodeDataChanged, state=SyncConnected, 
 path=/prodjobs/unassigned/3516b74d0c9d4458c2f2f715249e3f78
 ...
 2011-03-16 15:23:02,183 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:6-0x22e286f0b9c98f1 Retrieved 127 byte(s) of data from znode 
 /prodjobs/unassigned/3516b74d0c9d4458c2f2f715249e3f78 and set watcher; 
 region=tsdb,^@^D2McZ@^@^@^A^@^@G^@^@^L^@^@f^@^@^U^@^@�^@^@(^@^C^G,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.,
  server=sv4borg39,60020,1300313564807, state=RS_ZK_REGION_OPENED
 2011-03-16 15:23:02,183 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=sv4borg39,60020,1300313564807, 
 region=3516b74d0c9d4458c2f2f715249e3f78
 # At this point we've queued an Excecutor to run to process the OPENED event. 
  Now in comes the SPLIT.
 2011-03-16 15:23:18,199 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.:
  Daughters; 
 tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1300314189812.74c51400bb8dfa127fadfd11a04d72f2.,
  
 tsdb,\x00\x042MmD\x88\x00\x00\x01\x00\x00S\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x029\x00\x00(\x00\x03\x03,1300314189812.87b061739a11d0f9d02acfb92ef961a2.
  from sv4borg39,60020,1300313564807
 2011-03-16 15:23:18,870 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Split report has RIT node 
 (shouldnt have one): REGION = {NAME = 
 'tsdb,\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.',
  STARTKEY = 
 '\x00\x042McZ@\x00\x00\x01\x00\x00G\x00\x00\x0C\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x03\x07',
  ENDKEY = 
 '\x00\x043L\xE7\xF50\x00\x00\x01\x00\x00I\x00\x00\x0C\x00\x00f\x00\x00\x0E\x00\x00f\x00\x00\x15\x00\x00\xA9\x00\x00(\x00\x02u',
  ENCODED = 3516b74d0c9d4458c2f2f715249e3f78, TABLE = {{NAME = 'tsdb', 
 FAMILIES = [{NAME = 't', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', 
 VERSIONS = '3', COMPRESSION = 'LZO', TTL = '2147483647', BLOCKSIZE = 
 '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]}} node: 
 region=tsdb,^@^D2McZ@^@^@^A^@^@G^@^@^L^@^@f^@^@^U^@^@�^@^@(^@^C^G,1299401073466.3516b74d0c9d4458c2f2f715249e3f78.,
  server=sv4borg39,60020,1300313564807, state=RS_ZK_REGION_OPENED
 # Now 

  1   2   >