Hey Jacob, Sorry you hit this issue. Yes HBASE-20723 introduced this bug and it is fixed in HBASE-20734.
As for on EMR, emr-5.18.0 contains HBASE-20734 and this bug is not present there. Please try upgrading to emr-5.18.0 to avoid the error. Thanks, Zach On Mon, Nov 26, 2018 at 12:39 PM LeBlanc, Jacob < [email protected]> wrote: > Hi, > > We've recently upgraded our production clusters to 1.4.6. We have jobs > periodically run that take snapshots of some of our hbase tables and these > jobs seem to be running into > https://issues.apache.org/jira/browse/HBASE-21069. I understand there was > a missing null check, but in the bug I don't really see any explanation of > how the null occurs in the first place. For those of us running 1.4.6, is > there anything we can do to avoid hitting the bug? > > This problem is made worse because we are running a cluster in AWS EMR, > meaning our WAL is on a different filesystem (HDFS) and the hbase root > directory (EMRFS), and we are hitting some sort of issue where sometimes > the master gets stuck while splitting a WAL from the crashed region server: > > 2018-11-20 12:01:58,599 ERROR [split-log-closeStream-2] wal.WALSplitter: > Couldn't rename > s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359708-ip-172-20-113-197.us-west-2.compute.internal%2C16020%2C1542620776146.1542673338055.temp > to > s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720 > java.io.IOException: Cannot get log reader > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:365) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.deleteOneWithFewerEntries(WALSplitter.java:1363) > at > org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1496) > at > org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1448) > at > org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1445) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: Wrong FS: > s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720, > expected: hdfs://ip-172-20-113-83.us-west-2.compute.internal:8020 > at > org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:669) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:337) > at > org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303) > ... 12 more > > It seems like https://issues.apache.org/jira/browse/HBASE-20723 did not > hit all use cases. My understanding is that in 1.4.8 the recovered edits > are collocated with the WAL so this will no longer be an issue ( > https://issues.apache.org/jira/browse/HBASE-20734) but AWS has yet to > release an EMR with 1.4.8 so this is causing us pain right now when we hit > this situation (it doesn't seem to happen every time a region server > crashes - only twice so far). > > Unfortunately because we are running an AWS EMR cluster, so we can't > really just patch the region servers ourselves. We have the option of > upgrading to 1.4.7 to get the fix for HBASE-21069, but that will take us a > little time to test, release, and schedule downtime for our application so > any mitigating steps we could take in the meantime would be appreciated. > > Thanks, > > --Jacob LeBlanc > > >
