[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-14 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512915#comment-16512915
 ] 

Josh Elser edited comment on HBASE-20723 at 6/14/18 7:53 PM:
-

[~yuzhih...@gmail.com], have you tried writing a simple minicluster test to 
reproduce the issue? My understanding is that we should be able to easily 
trigger this in a contrived local case. My thinking is that a high-level test 
should make this crystal-clear as to the issue for folks.
 * Create one RS minicluster
 * Configure store files on mini HDFS
 * Configure WALs on local filesystem
 * Write some edits to a table (small, to avoid possible flush)
 * Restart RS
 * Observe edits missing from our table

Seems like this went unnoticed due to differences in recovery semantics from 
fshlog to multiwal? Is this a guess or are we sure of this?


was (Author: elserj):
[~yuzhih...@gmail.com], have you tried writing a simple minicluster test to 
reproduce the issue? My understanding is that we should be able to easily 
trigger this in a contrived local case.
 * Create one RS minicluster
 * Configure store files on mini HDFS
 * Configure WALs on local filesystem
 * Write some edits to a table (small, to avoid possible flush)
 * Restart RS
 * Observe edits missing from our table

Seems like this went unnoticed due to differences in recovery semantics from 
fshlog to multiwal? Is this a guess or are we sure of this?

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20723.v1.txt, 20723.v2.txt, 20723.v3.txt, 20723.v4.txt, 
> 20723.v5.txt, 20723.v5.txt, 20723.v6.txt, logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Tak Lon (Stephen) Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511808#comment-16511808
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 11:53 PM:


[~yuzhih...@gmail.com] I see, but then my test case may be unrelated to this 
issue, because I'm using `hbase.wal.provider=multiwal` and 
`hbase.wal.regiongrouping.strategy=identity`; it should not call WALSplitter 
(let me know if I'm wrong)


was (Author: taklwu):
[~yuzhih...@gmail.com] I see, but then my test case may be unrelated to this 
issue, because I'm using `multiwal` and `identity`; it should not call 
WALSplitter (let me know if I'm wrong)

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Assignee: Ted Yu
>Priority: Major
> Attachments: 20723.v1.txt, logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Tak Lon (Stephen) Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511756#comment-16511756
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 11:20 PM:


I didn' set `hbase.wal.dir`, but it is the EMR default location on HDFS where 
it is set to `hdfs://:/user/hbase/WAL` (updated)


was (Author: taklwu):
I didn' set `hbase.wal.dir`, it is the default location on HDFS (I can confirm 
it when I test again and update here)

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Priority: Major
> Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Tak Lon (Stephen) Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511756#comment-16511756
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 11:01 PM:


I didn' set `hbase.wal.dir`, it is the default location on HDFS (I can confirm 
it when I test again and update here)


was (Author: taklwu):
I didn' set `hbase.wal.dir`, it is the default location on HDFS

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Priority: Major
> Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Tak Lon (Stephen) Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511733#comment-16511733
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 10:24 PM:


[~busbey] I went back to check, here is the more detail about the test.
 1. it was a 1 master (only HMaster) 6 slaves (HDFS + Regionserver) cluster 
running with HBase 1.4.2 on EMR, setting the `hbase.rootdir` to s3 bucket with 
identity multiwal

2. I didn't use ITBILL and verify it, I have my own program writes and verify 
it, please find 
[SequentialBatchWrite|https://drive.google.com/open?id=1IIYP8ypkjslrwDJekr61AJcNRWQjjBJK]
 and 
[README|https://drive.google.com/open?id=1g7DsGa1cutWLziPCIN1ko71fazHX-QVtlHavxPNYFU0]
 here (sorry my testing code may be dirty). it's not hard to reproduce it.

3. the `dfs.replication` was 2 (maybe this is why I saw data loss?) but I never 
turn off any datanode.

4. the executed client program didn't fail and continued writing till the end, 
although we I killed the assigned region server, the write were hanging for a 
while before moving on to next puts.

5. About logs on HDFS or HBASE, I can retest it next week and capture those 
ERROR or stacktrace, but may I ask what you maybe expected from the logs?

[~yuzhih...@gmail.com] logs had been garbage collected, so I will need to rerun 
it and see if those lines exists.


was (Author: taklwu):
[~busbey] I went back to check, here is the more detail about the test.
1. it was a 1 master (only HMaster) 6 slaves (HDFS + Regionserver) cluster 
running with HBase 1.4.2 on EMR, setting the `hbase.rootdir` to s3 bucket with 
identity multiwal

2. I didn't use ITBILL and verify it, I have my own program writes and verify 
it, please find 
[SequentialBatchWrite|https://drive.google.com/open?id=1IIYP8ypkjslrwDJekr61AJcNRWQjjBJK]
 and 
[README|https://drive.google.com/open?id=1g7DsGa1cutWLziPCIN1ko71fazHX-QVtlHavxPNYFU0]
 here (sorry my testing code may be dirty). it's not hard to reproduce it.

3. the `dfs.replication` was 2 (maybe this is why I saw data loss?) but I never 
turn off any datanode.

4. the executed client program didn't fail and continued writing till the end, 
although we I killed the assigned region server, the write were hanging for a 
while before moving on to next puts.

5. About logs on HDFS or HBASE, I can retest it next week and capture those 
ERROR or stacktrace, but may I ask what you maybe expected from the logs?

[~yuzhih...@gmail.com] logs had been garbage collected, so I will need to rerun 
it and see if those lines exists. 

 

 

 

 

 

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Priority: Major
> Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511644#comment-16511644
 ] 

Sean Busbey edited comment on HBASE-20723 at 6/13/18 8:49 PM:
--

{quote}
or the hflush in DFSOutputStream  using by WAL's ProtobufLogWriter AFA I 
understand is that it's writing blocks/packets to HDFS but not a complete WAL 
file, where those sent blocks/packets is a group of writes that have not been 
combined into a single file before WAL is being closed(). (let me know if I'm 
wrong)
{quote}

it's a group of writes that hflush promises us is in memory at all block 
replica DataNodes. it's true that the DataNode might not persist to disk yet, 
but unless all the nodes in the pipeline die the stream is supposed to be 
recoverable up to the point of the flush. This is one of the foundational 
blocks of HBase being a consistent datastore.

{quote}
So, I found this problem when testing HBase on S3 with a 3-nodes cluster and 
setting WAL on HDFS, wrote a hbase-client to sequentially write N (100k) 
records (which key and value are both number #1 to #N), terminate the assigned 
region server by `kill -9 $pid` and restart it. those writing region(s) will be 
reassigned to another region server in few seconds, the client program 
completes w/o errors but when verifying the records, few records were missing.
{quote}

This sounds like a dataloss bug. Is it easily reproducible? Does it show up 
using e.g. ITBLL with the region server killing chaos monkey?

Only 3 nodes in the cluster means that if we have block replication set to 3 
then we can't avoid having a local block. It's not ideal, but it shouldn't 
cause dataloss if we aren't losing the other two. Can you confirm block 
replication is set to >= 3 in HDFS?

Is the client making sure it got a success on the write before moving on to the 
next entry?

Can we get more details on specific versions?


was (Author: busbey):
{quote}
or the hflush in DFSOutputStream  using by WAL's ProtobufLogWriter AFA I 
understand is that it's writing blocks/packets to HDFS but not a complete WAL 
file, where those sent blocks/packets is a group of writes that have not been 
combined into a single file before WAL is being closed(). (let me know if I'm 
wrong)
{quote}

it's a group of writes that hflush promises us in in memory at all block 
replica DataNodes. it's true that the DataNode might not persist to disk yet, 
but unless all the nodes in the pipeline die the stream is supposed to be 
recoverable up to the point of the flush. This is one of the foundational 
blocks of HBase being a consistent datastore.

{quote}
So, I found this problem when testing HBase on S3 with a 3-nodes cluster and 
setting WAL on HDFS, wrote a hbase-client to sequentially write N (100k) 
records (which key and value are both number #1 to #N), terminate the assigned 
region server by `kill -9 $pid` and restart it. those writing region(s) will be 
reassigned to another region server in few seconds, the client program 
completes w/o errors but when verifying the records, few records were missing.
{quote}

This sounds like a dataloss bug. Is it easily reproducible? Does it show up 
using e.g. ITBLL with the region server killing chaos monkey?

Only 3 nodes in the cluster means that if we have block replication set to 3 
then we can't avoid having a local block. It's not ideal, but it shouldn't 
cause dataloss if we aren't losing the other two. Can you confirm replication 
is set to >= 3?

Is the client making sure it got a success on the write before moving on to the 
next entry?

Can we get more details on specific versions?

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Priority: Major
> Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> 

[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Tak Lon (Stephen) Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511621#comment-16511621
 ] 

Tak Lon (Stephen) Wu edited comment on HBASE-20723 at 6/13/18 8:21 PM:
---

for the [hflush in DFSOutputStream  
|https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L525]using
 by WAL's ProtobufLogWriter AFA I understand is that it's writing 
blocks/packets to HDFS but not a complete WAL file, where those sent 
blocks/packets is a group of writes that have not been combined into a single 
file before WAL is being closed(). (let me know if I'm wrong)

So, I found this problem when testing HBase on S3 with a 3-nodes cluster and 
setting WAL on HDFS, wrote a hbase-client to sequentially write N (100k) 
records (which key and value are both number #1 to #N), terminate the assigned 
region server by `kill -9 $pid` and restart it. those writing region(s) will be 
reassigned to another region server in few seconds, the client program 
completes w/o errors but when verifying the records, few records were missing.


was (Author: taklwu):
for the [hflush in DFSOutputStream  
|https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L525]using
 by WAL's ProtobufLogWriter AFA I understand is that it's writing 
blocks/packets to HDFS but not a complete WAL file, where those sent 
blocks/packets is a group of writes that have not been combined into a single 
file before WAL is being closed(). (let me know if I'm wrong)

So, I found this problem when testing HBase on S3 with a 3-nodes cluster and 
setting WAL on HDFS, wrote a hbase-client to sequentially write N records 
(which key and value are both number #1 to #N), terminate the assigned region 
server by `kill -9 $pid` and restart it. those writing region(s) will be 
reassigned to another region server in few seconds, the client program 
completes w/o errors but when verifying the records, few records were missing.

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Priority: Major
> Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20723) WALSplitter uses the rootDir, which is walDir, as the tableDir root path.

2018-06-13 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511552#comment-16511552
 ] 

Sean Busbey edited comment on HBASE-20723 at 6/13/18 7:01 PM:
--

{quote}
The one thing that one of my colleagues figured out recently is that edits 
aren't actually persisted to the WAL until they either reach a certain size or 
a time limit has elapsed that triggers the hsync() or hflush(). Since the VM 
didn't exit correctly, I'm assuming this is what happened. Can you try loading 
more data in (still under the flush size/interval), but enough to cause a hsync 
to the WAL file and see if you have the same issue?
{quote}

this isn't supposed to be the case though? we're not supposed to return a "OK" 
to the client doing the write until we're done an hflush. That won't help if 
the underlying nodes for HDFS all fail. Is the WAL HDFS instance set up to have 
3 replicas per block?

(Edit to replace hsync with hflush)


was (Author: busbey):
{quote}
The one thing that one of my colleagues figured out recently is that edits 
aren't actually persisted to the WAL until they either reach a certain size or 
a time limit has elapsed that triggers the hsync() or hflush(). Since the VM 
didn't exit correctly, I'm assuming this is what happened. Can you try loading 
more data in (still under the flush size/interval), but enough to cause a hsync 
to the WAL file and see if you have the same issue?
{quote}

this isn't supposed to be the case though? we're not supposed to return a "OK" 
to the client doing the write until we're done an hsync. That won't help if the 
underlying nodes for HDFS all fail. Is the WAL HDFS instance set up to have 3 
replicas per block?

> WALSplitter uses the rootDir, which is walDir, as the tableDir root path.
> -
>
> Key: HBASE-20723
> URL: https://issues.apache.org/jira/browse/HBASE-20723
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 1.1.2
>Reporter: Rohan Pednekar
>Priority: Major
> Attachments: logs.zip
>
>
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://x@y/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)