[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512915#comment-16512915 ] Josh Elser edited comment on HBASE-20723 at 6/14/18 7:53 PM: - [~yuzhih...@gmail.com], have you tried writing a simple minicluster test to reproduce the issue? My understanding is that we should be able to easily trigger this in a contrived local case. My thinking is that a high-level test should make this crystal-clear as to the issue for folks. * Create one RS minicluster * Configure store files on mini HDFS * Configure WALs on local filesystem * Write some edits to a table (small, to avoid possible flush) * Restart RS * Observe edits missing from our table Seems like this went unnoticed due to differences in recovery semantics from fshlog to multiwal? Is this a guess or are we sure of this? was (Author: elserj): [~yuzhih...@gmail.com], have you tried writing a simple minicluster test to reproduce the issue? My understanding is that we should be able to easily trigger this in a contrived local case. * Create one RS minicluster * Configure store files on mini HDFS * Configure WALs on local filesystem * Write some edits to a table (small, to avoid possible flush) * Restart RS * Observe edits missing from our table Seems like this went unnoticed due to differences in recovery semantics from fshlog to multiwal? Is this a guess or are we sure of this? > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Assignee: Ted Yu >Priority: Major > Attachments: 20723.v1.txt, 20723.v2.txt, 20723.v3.txt, 20723.v4.txt, > 20723.v5.txt, 20723.v5.txt, 20723.v6.txt, logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511808#comment-16511808 ] Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 11:53 PM: [~yuzhih...@gmail.com] I see, but then my test case may be unrelated to this issue, because I'm using `hbase.wal.provider=multiwal` and `hbase.wal.regiongrouping.strategy=identity`; it should not call WALSplitter (let me know if I'm wrong) was (Author: taklwu): [~yuzhih...@gmail.com] I see, but then my test case may be unrelated to this issue, because I'm using `multiwal` and `identity`; it should not call WALSplitter (let me know if I'm wrong) > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Assignee: Ted Yu >Priority: Major > Attachments: 20723.v1.txt, logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511756#comment-16511756 ] Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 11:20 PM: I didn' set `hbase.wal.dir`, but it is the EMR default location on HDFS where it is set to `hdfs://:/user/hbase/WAL` (updated) was (Author: taklwu): I didn' set `hbase.wal.dir`, it is the default location on HDFS (I can confirm it when I test again and update here) > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Priority: Major > Attachments: logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511756#comment-16511756 ] Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 11:01 PM: I didn' set `hbase.wal.dir`, it is the default location on HDFS (I can confirm it when I test again and update here) was (Author: taklwu): I didn' set `hbase.wal.dir`, it is the default location on HDFS > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Priority: Major > Attachments: logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511733#comment-16511733 ] Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 10:24 PM: [~busbey] I went back to check, here is the more detail about the test. 1. it was a 1 master (only HMaster) 6 slaves (HDFS + Regionserver) cluster running with HBase 1.4.2 on EMR, setting the `hbase.rootdir` to s3 bucket with identity multiwal 2. I didn't use ITBILL and verify it, I have my own program writes and verify it, please find [SequentialBatchWrite|https://drive.google.com/open?id=1IIYP8ypkjslrwDJekr61AJcNRWQjjBJK] and [README|https://drive.google.com/open?id=1g7DsGa1cutWLziPCIN1ko71fazHX-QVtlHavxPNYFU0] here (sorry my testing code may be dirty). it's not hard to reproduce it. 3. the `dfs.replication` was 2 (maybe this is why I saw data loss?) but I never turn off any datanode. 4. the executed client program didn't fail and continued writing till the end, although we I killed the assigned region server, the write were hanging for a while before moving on to next puts. 5. About logs on HDFS or HBASE, I can retest it next week and capture those ERROR or stacktrace, but may I ask what you maybe expected from the logs? [~yuzhih...@gmail.com] logs had been garbage collected, so I will need to rerun it and see if those lines exists. was (Author: taklwu): [~busbey] I went back to check, here is the more detail about the test. 1. it was a 1 master (only HMaster) 6 slaves (HDFS + Regionserver) cluster running with HBase 1.4.2 on EMR, setting the `hbase.rootdir` to s3 bucket with identity multiwal 2. I didn't use ITBILL and verify it, I have my own program writes and verify it, please find [SequentialBatchWrite|https://drive.google.com/open?id=1IIYP8ypkjslrwDJekr61AJcNRWQjjBJK] and [README|https://drive.google.com/open?id=1g7DsGa1cutWLziPCIN1ko71fazHX-QVtlHavxPNYFU0] here (sorry my testing code may be dirty). it's not hard to reproduce it. 3. the `dfs.replication` was 2 (maybe this is why I saw data loss?) but I never turn off any datanode. 4. the executed client program didn't fail and continued writing till the end, although we I killed the assigned region server, the write were hanging for a while before moving on to next puts. 5. About logs on HDFS or HBASE, I can retest it next week and capture those ERROR or stacktrace, but may I ask what you maybe expected from the logs? [~yuzhih...@gmail.com] logs had been garbage collected, so I will need to rerun it and see if those lines exists. > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Priority: Major > Attachments: logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511644#comment-16511644 ] Sean Busbey edited comment on HBASE-20723 at 6/13/18 8:49 PM: -- {quote} or the hflush in DFSOutputStream using by WAL's ProtobufLogWriter AFA I understand is that it's writing blocks/packets to HDFS but not a complete WAL file, where those sent blocks/packets is a group of writes that have not been combined into a single file before WAL is being closed(). (let me know if I'm wrong) {quote} it's a group of writes that hflush promises us is in memory at all block replica DataNodes. it's true that the DataNode might not persist to disk yet, but unless all the nodes in the pipeline die the stream is supposed to be recoverable up to the point of the flush. This is one of the foundational blocks of HBase being a consistent datastore. {quote} So, I found this problem when testing HBase on S3 with a 3-nodes cluster and setting WAL on HDFS, wrote a hbase-client to sequentially write N (100k) records (which key and value are both number #1 to #N), terminate the assigned region server by `kill -9 $pid` and restart it. those writing region(s) will be reassigned to another region server in few seconds, the client program completes w/o errors but when verifying the records, few records were missing. {quote} This sounds like a dataloss bug. Is it easily reproducible? Does it show up using e.g. ITBLL with the region server killing chaos monkey? Only 3 nodes in the cluster means that if we have block replication set to 3 then we can't avoid having a local block. It's not ideal, but it shouldn't cause dataloss if we aren't losing the other two. Can you confirm block replication is set to >= 3 in HDFS? Is the client making sure it got a success on the write before moving on to the next entry? Can we get more details on specific versions? was (Author: busbey): {quote} or the hflush in DFSOutputStream using by WAL's ProtobufLogWriter AFA I understand is that it's writing blocks/packets to HDFS but not a complete WAL file, where those sent blocks/packets is a group of writes that have not been combined into a single file before WAL is being closed(). (let me know if I'm wrong) {quote} it's a group of writes that hflush promises us in in memory at all block replica DataNodes. it's true that the DataNode might not persist to disk yet, but unless all the nodes in the pipeline die the stream is supposed to be recoverable up to the point of the flush. This is one of the foundational blocks of HBase being a consistent datastore. {quote} So, I found this problem when testing HBase on S3 with a 3-nodes cluster and setting WAL on HDFS, wrote a hbase-client to sequentially write N (100k) records (which key and value are both number #1 to #N), terminate the assigned region server by `kill -9 $pid` and restart it. those writing region(s) will be reassigned to another region server in few seconds, the client program completes w/o errors but when verifying the records, few records were missing. {quote} This sounds like a dataloss bug. Is it easily reproducible? Does it show up using e.g. ITBLL with the region server killing chaos monkey? Only 3 nodes in the cluster means that if we have block replication set to 3 then we can't avoid having a local block. It's not ideal, but it shouldn't cause dataloss if we aren't losing the other two. Can you confirm replication is set to >= 3? Is the client making sure it got a success on the write before moving on to the next entry? Can we get more details on specific versions? > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Priority: Major > Attachments: logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] >
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511621#comment-16511621 ] Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 8:21 PM: --- for the [hflush in DFSOutputStream |https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L525]using by WAL's ProtobufLogWriter AFA I understand is that it's writing blocks/packets to HDFS but not a complete WAL file, where those sent blocks/packets is a group of writes that have not been combined into a single file before WAL is being closed(). (let me know if I'm wrong) So, I found this problem when testing HBase on S3 with a 3-nodes cluster and setting WAL on HDFS, wrote a hbase-client to sequentially write N (100k) records (which key and value are both number #1 to #N), terminate the assigned region server by `kill -9 $pid` and restart it. those writing region(s) will be reassigned to another region server in few seconds, the client program completes w/o errors but when verifying the records, few records were missing. was (Author: taklwu): for the [hflush in DFSOutputStream |https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L525]using by WAL's ProtobufLogWriter AFA I understand is that it's writing blocks/packets to HDFS but not a complete WAL file, where those sent blocks/packets is a group of writes that have not been combined into a single file before WAL is being closed(). (let me know if I'm wrong) So, I found this problem when testing HBase on S3 with a 3-nodes cluster and setting WAL on HDFS, wrote a hbase-client to sequentially write N records (which key and value are both number #1 to #N), terminate the assigned region server by `kill -9 $pid` and restart it. those writing region(s) will be reassigned to another region server in few seconds, the client program completes w/o errors but when verifying the records, few records were missing. > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Priority: Major > Attachments: logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
[ https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511552#comment-16511552 ] Sean Busbey edited comment on HBASE-20723 at 6/13/18 7:01 PM: -- {quote} The one thing that one of my colleagues figured out recently is that edits aren't actually persisted to the WAL until they either reach a certain size or a time limit has elapsed that triggers the hsync() or hflush(). Since the VM didn't exit correctly, I'm assuming this is what happened. Can you try loading more data in (still under the flush size/interval), but enough to cause a hsync to the WAL file and see if you have the same issue? {quote} this isn't supposed to be the case though? we're not supposed to return a "OK" to the client doing the write until we're done an hflush. That won't help if the underlying nodes for HDFS all fail. Is the WAL HDFS instance set up to have 3 replicas per block? (Edit to replace hsync with hflush) was (Author: busbey): {quote} The one thing that one of my colleagues figured out recently is that edits aren't actually persisted to the WAL until they either reach a certain size or a time limit has elapsed that triggers the hsync() or hflush(). Since the VM didn't exit correctly, I'm assuming this is what happened. Can you try loading more data in (still under the flush size/interval), but enough to cause a hsync to the WAL file and see if you have the same issue? {quote} this isn't supposed to be the case though? we're not supposed to return a "OK" to the client doing the write until we're done an hsync. That won't help if the underlying nodes for HDFS all fail. Is the WAL HDFS instance set up to have 3 replicas per block? > WALSplitter uses the rootDir, which is walDir, as the tableDir root path. > - > > Key: HBASE-20723 > URL: https://issues.apache.org/jira/browse/HBASE-20723 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 1.1.2 >Reporter: Rohan Pednekar >Priority: Major > Attachments: logs.zip > > > This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase > 1.1.2.2.6.3.2-14 > By default the underlying data is going to wasb://x@y/hbase > I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at > /mnt. > hbase.wal.dir= hdfs://mycluster/walontest > hbase.wal.dir.perms=700 > hbase.rootdir.perms=700 > hbase.rootdir= > wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase > Procedure to reproduce this issue: > 1. create a table in hbase shell > 2. insert a row in hbase shell > 3. reboot the VM which hosts that region > 4. scan the table in hbase shell and it is empty > Looking at the region server logs: > {code:java} > 2018-06-12 22:08:40,455 INFO [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] > wal.WALSplitter: This region's directory doesn't exist: > hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. > It is very likely that it was already split so it's safe to discard those > edits. > {code} > The log split/replay ignored actual WAL due to WALSplitter is looking for the > region directory in the hbase.wal.dir we specified rather than the > hbase.rootdir. > Looking at the source code, > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java > it uses the rootDir, which is walDir, as the tableDir root path. > So if we use HBASE-17437, waldir and hbase rootdir are in different path or > even in different filesystem, then the #5 uses walDir as tableDir is > apparently wrong. > CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)