Re: Any way to avoid HBASE-21069?

Zach York Mon, 26 Nov 2018 12:59:30 -0800

Hey Jacob,

Sorry you hit this issue. Yes HBASE-20723 introduced this bug and it is
fixed in HBASE-20734.


As for on EMR, emr-5.18.0 contains HBASE-20734 and this bug is not present
there. Please try upgrading to emr-5.18.0 to avoid the error.

Thanks,
Zach

On Mon, Nov 26, 2018 at 12:39 PM LeBlanc, Jacob <
[email protected]> wrote:

> Hi,
>
> We've recently upgraded our production clusters to 1.4.6. We have jobs
> periodically run that take snapshots of some of our hbase tables and these
> jobs seem to be running into
> https://issues.apache.org/jira/browse/HBASE-21069. I understand there was
> a missing null check, but in the bug I don't really see any explanation of
> how the null occurs in the first place. For those of us running 1.4.6, is
> there anything we can do to avoid hitting the bug?
>
> This problem is made worse because we are running a cluster in AWS EMR,
> meaning our WAL is on a different filesystem (HDFS) and the hbase root
> directory (EMRFS), and we are hitting some sort of issue where sometimes
> the master gets stuck while splitting a WAL from the crashed region server:
>
> 2018-11-20 12:01:58,599 ERROR [split-log-closeStream-2] wal.WALSplitter:
> Couldn't rename
> s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359708-ip-172-20-113-197.us-west-2.compute.internal%2C16020%2C1542620776146.1542673338055.temp
> to
> s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720
> java.io.IOException: Cannot get log reader
>                 at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:365)
>                 at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
>                 at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
>                 at
> org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.deleteOneWithFewerEntries(WALSplitter.java:1363)
>                 at
> org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1496)
>                 at
> org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1448)
>                 at
> org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1445)
>                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>                 at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>                 at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>                 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>                 at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Wrong FS:
> s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720,
> expected: hdfs://ip-172-20-113-83.us-west-2.compute.internal:8020
>                 at
> org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:669)
>                 at
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
>                 at
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)
>                 at
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)
>                 at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>                 at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:337)
>                 at
> org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790)
>                 at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
>                 ... 12 more
>
> It seems like https://issues.apache.org/jira/browse/HBASE-20723 did not
> hit all use cases. My understanding is that in 1.4.8 the recovered edits
> are collocated with the WAL so this will no longer be an issue (
> https://issues.apache.org/jira/browse/HBASE-20734) but AWS has yet to
> release an EMR with 1.4.8 so this is causing us pain right now when we hit
> this situation (it doesn't seem to happen every time a region server
> crashes - only twice so far).
>
> Unfortunately because we are running an AWS EMR cluster, so we can't
> really just patch the region servers ourselves. We have the option of
> upgrading to 1.4.7 to get the fix for HBASE-21069,  but that will take us a
> little time to test, release, and schedule downtime for our application so
> any mitigating steps we could take in the meantime would be appreciated.
>
> Thanks,
>
> --Jacob LeBlanc
>
>
>

Re: Any way to avoid HBASE-21069?

Reply via email to