[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117302#comment-14117302 ] Hudson commented on HBASE-11591: FAILURE: Integrated in HBase-1.0 #142 (See [https://builds.apache.org/job/HBase-1.0/142/]) HBASE-11591 Scanner fails to retrieve KV from bulk loaded file with (ramkrishna: rev 844f3dfb6a9b2267b7e06ee2a176c76ae89ff7bf) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestScannerWithBulkload.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0, 2.0.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0, 2.0.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, HBASE-11591_6.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117341#comment-14117341 ] Anoop Sam John commented on HBASE-11591: +1 for addendum for branch-1. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0, 2.0.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0, 2.0.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, HBASE-11591_6.patch, HBASE-11591_branch-1-addendum.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116217#comment-14116217 ] Anoop Sam John commented on HBASE-11591: [~ram_krish] The affected version and fix version given as 0.99 but pushed only to master? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, HBASE-11591_6.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110344#comment-14110344 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664316/HBASE-11591_6.patch against trunk revision . ATTACHMENT ID: 12664316 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 8 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:red}-1 core tests{color}. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10578//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110403#comment-14110403 ] ramkrishna.s.vasudevan commented on HBASE-11591: The javadoc warning is not from this patch. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110573#comment-14110573 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664353/HBASE-11591_6.patch against trunk revision . ATTACHMENT ID: 12664353 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 8 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 2 zombie test(s): at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:250) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10585//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, HBASE-11591_6.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left,
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110744#comment-14110744 ] Hudson commented on HBASE-11591: FAILURE: Integrated in HBase-TRUNK #5431 (See [https://builds.apache.org/job/HBase-TRUNK/5431/]) HBASE-11591 Scanner fails to retrieve KV from bulk loaded file with (ramkrishna: rev dea6480023e78a3facdaf1cfc00ad6cc35ecb3ea) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestScanWithBloomError.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestScannerWithBulkload.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, HBASE-11591_6.patch, HBASE-11591_6.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109021#comment-14109021 ] ramkrishna.s.vasudevan commented on HBASE-11591: [~jeffreyz] Is the latest patch good for commit? [~anoop.hbase], [~saint@gmail.com] What you guys think? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110032#comment-14110032 ] Jeffrey Zhong commented on HBASE-11591: --- The patch looks good to me(+1) with one minor comment: {noformat} + w = mvcc.beginMemstoreInsert(); + long flushSeqId = getNextSequenceId(wal); + FlushResult flushResult = new FlushResult( + FlushResult.Result.CANNOT_FLUSH_MEMSTORE_EMPTY, flushSeqId, Nothing to flush); + w.setWriteNumber(flushSeqId); + mvcc.waitForPreviousTransactionsComplete(w); + return flushResult {noformat} You can set w=null after mvcc.waitForPreviousTransactionsComplete(w); so mvcc.advanceMemstore in finally block can be skipped. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110275#comment-14110275 ] ramkrishna.s.vasudevan commented on HBASE-11591: bq.You can set w=null after mvcc.waitForPreviousTransactionsComplete(w); so mvcc.advanceMemstore in finally block can be skipped. Thanks for the review. Will update and commit the patch. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105185#comment-14105185 ] ramkrishna.s.vasudevan commented on HBASE-11591: [~jeffreyz] First of all thanks a lot for taking a look at this issue and providing a patch. I debugged this issue with 2 cases - with my patch and with Jeffrey's patch. Observed the following things - The testcase that was added as part of this testcase is for same KVs in the Store file and the bulk loaded Store file and it was specific for that issue. - After Jeffrey's first patch few testcases failed and those were not having this case of same KVs. All were different row keys. Things were working fine because it was purely based on row key comparison and no mvcc would have even come into it. (I mean even before the patch). I think that exposed some of the bug that was inside. - Another important observation is that when we are scanning the KVs in the bulk loaded file (atleast those created new LoadIncrementalHFile cases) there is no mvcc info added to the metadata also. So {code} return new StoreFileScanner(this, getScanner(cacheBlocks, pread, isCompaction), !isCompaction, reader.hasMVCCInfo(), readPt); {code} will say has mvccInfo as false and hence skipNewerThanReadPoint() would never be called because {code} if (hasMVCCInfo) skipKVsNewerThanReadpoint(); {code} So before the patch too, the scenario in the failed test case TestWALReplay.testCompactedBulkLoadedFiles() though our seqID for the bulk loaded files were 5, and the read point for all the scanners created in the test case was 4 - we were trying to read the bulk loaded file also. But we were not able to skip the kvs in the bulk loaded file just because hasMvccInfo was false. So the tests were passing. Ok so what happens after Jeffrey's patch(the first patch without HREgion's change) is that on seeing any bulk loaded file we just assign the file's seqid to the KV's seqId. And so after compaction still the read pt is not modified to the latest (ie 5) and hence all the KVs that were written to the compacted file from the bulk loaded files were missing. I think the change in HRegion.java to set the write Sequence number is a bug fix? I still feel the patch would cause issue in the following scenario after the above changes - Assume a scan started and the read point is 20 at that time - Bulk load is just getting completed and the scanner heap gets reset. The new bulk loaded file with seqId 22 (for eg) gets added now to the scanner heap. But remember that the read point is still 20. - After this change we would just set the bulk load file's seqId to all its KVs which is 22. - Because there is no mvcc info in this bulk loaded file the scan would not be able to skipTheKvsWithNewerReadPt() and hence the scan would still see the Kvs with 22 as the seqId though the intention is to see only KVs with seqID 20. I may be wrong. Am I missing something here? I may be wrong because for bulk loaded files because there is no mvcc we are allowed to read anything in that irrespective of the read pt? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code}
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105286#comment-14105286 ] ramkrishna.s.vasudevan commented on HBASE-11591: Another thing is that while doing flush as part of Bulk load if there is nothing to be flushed should we still update the mvcc. {code} if (this.memstoreSize.get() = 0) { // Presume that if there are still no edits in the memstore, then there are no edits for // this region out in the WAL/HLog subsystem so no need to do any trickery clearing out // edits in the WAL system. Up the sequence number so the resulting flush id is for // sure just beyond the last appended region edit (useful as a marker when bulk loading, // etc.) // wal can be null replaying edits. return wal != null? new FlushResult(FlushResult.Result.CANNOT_FLUSH_MEMSTORE_EMPTY, getNextSequenceId(wal), Nothing to flush): new FlushResult(FlushResult.Result.CANNOT_FLUSH_MEMSTORE_EMPTY, Nothing to flush); } {code} Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105428#comment-14105428 ] Ted Yu commented on HBASE-11591: Minor: {code} -return !hasMVCCInfo ? true : skipKVsNewerThanReadpoint(); +if (!hasMVCCInfo this.reader.isBulkLoaded()) { + return skipKVsNewerThanReadpoint(); +} else { + return !hasMVCCInfo ? true : skipKVsNewerThanReadpoint(); {code} The if condition above would be more readable if written this way: {code} +if (hasMVCCInfo || this.reader.isBulkLoaded()) { {code} Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105659#comment-14105659 ] Jeffrey Zhong commented on HBASE-11591: --- [~ram_krish] {quote} I think the change in HRegion.java to set the write Sequence number is a bug fix? I still feel the patch would cause issue in the following scenario after the above changes - Assume a scan started and the read point is 20 at that time - Bulk load is just getting completed and the scanner heap gets reset. The new bulk loaded file with seqId 22 (for eg) gets added now to the scanner heap. But remember that the read point is still 20. - After this change we would just set the bulk load file's seqId to all its KVs which is 22. - Because there is no mvcc info in this bulk loaded file the scan would not be able to skipTheKvsWithNewerReadPt() and hence the scan would still see the Kvs with 22 as the seqId though the intention is to see only KVs with seqID 20. I may be wrong. Am I missing something here? I may be wrong because for bulk loaded files because there is no mvcc we are allowed to read anything in that irrespective of the read pt? {quote} The situation above is valid. While existing behavior(like 0.98), we allow a scan with lower readpt to read a bulk loaded file immediately as we can load a hfile atomically. I think it's fine either keeping existing behavior or add handling for such cases. Another option to handle such case you can set hasMVCCInfo to true for a bulk loaded file because we will set its KVs' mvcc using Hfile seqId. [~jerryhe] {quote}It can be used backup HBase data and restore.{quote} For such case, trunk code can handle it but you need to keep deleted cells mvcc forever(using config hbase.hstore.compaction.keep.mvcc.period). When you load a old hfile, its KVs will be sorted correctly based on their mvcc values(LogSeqId). Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106424#comment-14106424 ] ramkrishna.s.vasudevan commented on HBASE-11591: [~jeffreyz] bq.I think it's fine either keeping existing behavior or add handling for such cases Then I think the above change is not necessary and only handle the case as per the initial patch where we handle same KVs case. The later patches attached is some how trying to make the mvcc accept the bulk load files and make it visible just because we are setting the bulk load file's seq id to the KV from the bulk loaded file. I think if we have to maintain the behaviour then only handle the case of same Kvs should be fine. If not the other changes are necessary. This is my take on this. Pls feel free to correct me. But thinking of cases like [~jerryhe] then this change is right and for that we need to handle all cases. What do other guys think? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, HBASE-11591_5.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103539#comment-14103539 ] ramkrishna.s.vasudevan commented on HBASE-11591: So instead of setting it in the comparator you would set it when the kv is retrieved. Should we really do this here {code} setCurrentCell(KeyValueUtil.createLastOnRowCol(kv)); {code} and {code} setCurrentCell(KeyValueUtil.createFirstOnRowColTS(kv, maxTimestampInFile)); {code} This is a fake key that we are creating right? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104134#comment-14104134 ] ramkrishna.s.vasudevan commented on HBASE-11591: [~saint@gmail.com] bq.A marker Interface that allows you set sequence id on the hosting object seems fine. MutableCell is a little ugly since it tarnishes our nice 'Cell' notion. Did not see the comment. Pls see HBASE-11777. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104311#comment-14104311 ] Jeffrey Zhong commented on HBASE-11591: --- [~jerryhe] {quote}If it is bulkloaded file, could we just set the cells regardless of its old seqId in the cell?{quote} Yes, we could. The condition is to prevent a Cell from keeping reset [~ram_krish] {quote}This is a fake key that we are creating right?{quote} Yes, you're right that we don't have to use setCurrentCell in these two cases. The patch is to use a consistent way to set instance variable cur so that it's easy to maintainreasoning in the future or we do more in the setCurrentCell call. I guess there is no much difference either way. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104542#comment-14104542 ] Jerry He commented on HBASE-11591: -- Hi, [~jeffreyz] Regarding the cur.getSequenceId() = 0 condition again, it is possible that the cells in the original bulk load hfiles have seqId 0. In this case, we also need to reset them to the new file level seqId. Right? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104559#comment-14104559 ] Jeffrey Zhong commented on HBASE-11591: --- {quote} it is possible that the cells in the original bulk load hfiles have seqId 0. {quote} For bulk loaded files, this should not happen. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104575#comment-14104575 ] Jerry He commented on HBASE-11591: -- Hi, [~jeffreyz] There are such use cases. Please see HBASE-11772. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104703#comment-14104703 ] Jeffrey Zhong commented on HBASE-11591: --- I c. For HBASE-11772 situation, it's possible. Not sure what's use scenario for loading a native hfile directly and also need more to make that work though. We can take the condition cur.getSequenceId() = 0 out here or we can take it out in patch of HBASE-11772. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104774#comment-14104774 ] Jerry He commented on HBASE-11591: -- It can be used backup HBase data and restore. Either way is fine. I will pick up the work in HBASE-11772 on top of whatever is done here. Thanks! Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104973#comment-14104973 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663192/hbase-11591-03-02.patch against trunk revision . ATTACHMENT ID: 12663192 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 7 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10513//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-02.patch, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison;
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102535#comment-14102535 ] stack commented on HBASE-11591: --- There is the SequenceNumber Interface but that is only about getting a SequenceNumber. As per you fellows, don't think we need add method to Cell. There are no setters in Cell currently. Why start now. A marker Interface that allows you set sequence id on the hosting object seems fine. MutableCell is a little ugly since it tarnishes our nice 'Cell' notion. What about adding setter on SequenceNumber? One of the implementors is HLogKey. It has a: void setLogSeqNum(final long sequence) { this.logSeqNum = sequence; this.seqNumAssignedLatch.countDown(); } Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103329#comment-14103329 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662879/hbase-11591-03-jeff.patch against trunk revision . ATTACHMENT ID: 12662879 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 7 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestSecureWALReplay org.apache.hadoop.hbase.regionserver.wal.TestWALReplay org.apache.hadoop.hbase.regionserver.wal.TestWALReplayCompressed {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.mapreduce.TestRowCounter.testRowCounterHiddenColumn(TestRowCounter.java:137) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10491//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103405#comment-14103405 ] Jerry He commented on HBASE-11591: -- Looking at this JIRA and the patches, I think it would benefit HBase-11772 as well, particularly the idea to set mvcc (seqld) in the cell with the seqId of the bulkloaded file. Looks good! [~jeffreyz]: {code} + protected void setCurrentCell(Cell newVal) { +this.cur = newVal; +if(this.cur != null this.reader.isBulkLoaded() cur.getSequenceId() = 0) { + KeyValue curKV = KeyValueUtil.ensureKeyValue(cur); + curKV.setSequenceId(this.reader.getSequenceID()); + cur = curKV; +} + } {code} You have cur.getSequenceId() = 0 in the if condition. If it is bulkloaded file, could we just set the cells regardless of its old seqId in the cell? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java, hbase-11591-03-jeff.patch See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100310#comment-14100310 ] Anoop Sam John commented on HBASE-11591: Sure. Some quick comments after a glance at the patch isBulkLoadResult - isBulkLoaded()? For setter also? I see this isBulkLoadResult () in StoreFile.java level also. I would have been better to know this status from StoreFile rather than from StoreFileReader. Also what abt compacting a flush file and a bulk loaded one? Will we have issues then? This patch will handle that also? Mind adding tests around that also. compareWithoutMvcc(Cell left, Cell right) Now we have deprecated *mvcc () methods. Suggest change in name here also. bq.// TODO : While doing cells this is should be avoided in the read path. IMHO we should not do this KeyValueUtil.ensureKeyValue() stuff from now. (In read path mainly) In near future we will want Cells in read path. How we can solve this particular issue then? (We can not add setter in Cell.java I believe) Or do we need an extension interface for Cell *in server side* which is having the setter? Doing a deeper look Ram. Sorry for being late. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100361#comment-14100361 ] Anoop Sam John commented on HBASE-11591: {code} + if(bulkLoad) { +// TODO : While doing cells this is should be avoided in the read path. +KeyValue leftKV = KeyValueUtil.ensureKeyValue(left.peek()); +KeyValue rightKV = KeyValueUtil.ensureKeyValue(right.peek()); +if(leftKV.getSequenceId() == 0) { + leftKV.setSequenceId(rightKV.getSequenceId()); +} else { + rightKV.setSequenceId(leftKV.getSequenceId()); +} + } {code} So what do we do here Ram? I think we need to set KV seqId for KVs, from bulk loaded file, to the file seqId (which we get from that file name). So instead of this set seqId of one KV to other (which looks hacky IMO) can we do the set by the seqId of the file? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100372#comment-14100372 ] Hadoop QA commented on HBASE-11591: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662425/HBASE-11591_2.patch against trunk revision . ATTACHMENT ID: 12662425 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10473//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID =
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100384#comment-14100384 ] ramkrishna.s.vasudevan commented on HBASE-11591: I got a clean QA run. bq.isBulkLoadResult - isBulkLoaded()? For setter also? Okie. Fine with that. bq.I see this isBulkLoadResult () in StoreFile.java level also. I would have been better to know this status from StoreFile rather than from StoreFileReader. I spent some time for doing it. Later decided this way.First thing is that only the reader is passed to the StoreFileScanner and storefilescanner only has a reader associated with it. So if we need to have this informaiton from Storefile then i need to change the constructor of StoreFileScanner or use a setter. I thought that was making the patch heavier. Also in this case the information of bulk load or not has to be passed from the reader (because the reader reads the file info) and then set that on the Storefile. Currently reader is also an inner class of StoreFile. Considering all this i just kept the new getter/setter in the Reader level. bq.compareWithoutMvcc Okie. bq.IMHO we should not do this KeyValueUtil.ensureKeyValue() stuff from now Yes.. But i think that we should do in a separete JIRA infact to avoid this setSeqId but doing KeyValueUtil.ensureKeyValue(). bq.I think we need to set KV seqId for KVs, from bulk loaded file, to the file seqId Yes.. I did set the other KV's sequence id because I wanted to ensure that we return one of the KVs from the two of them that are contesting here and ensure that we return a KV like what would have been returned if there was no clash and the lastest one was from the flushed file. Anyway before changing this let me check some more cases. Then would update the patch accordingly. Infact I had set the sequenceId of the file and later changed it to this way. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100392#comment-14100392 ] ramkrishna.s.vasudevan commented on HBASE-11591: bq.Also what abt compacting a flush file and a bulk loaded one? Will we have issues then? This patch will handle that also? Mind adding tests around that also. The current test is also compacting the flushed files. Behaviour wise both would be same in 0.99+. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100539#comment-14100539 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662463/HBASE-11591_3.patch against trunk revision . ATTACHMENT ID: 12662463 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestRegionRebalancing Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10477//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100860#comment-14100860 ] Ted Yu commented on HBASE-11591: {code} + * Compares two cells without mvcc + * + * @param left + * @param right + * @return less than 0 if left is smaller, 0 if equal etc.. + */ +public int compareWithoutSeqId(Cell left, Cell right) { {code} Change javadoc to match the method name. Cell is marked @InterfaceStability.Evolving setSequenceId() should be added to Cell interface - in another issue. {code} +public class TestScannerWithBulkload { + private final static HBaseTestingUtility TEST_UTIL = new HBaseTestingUtility(); + private final static String tableName = testBulkload; {code} Please change tableName to match test name. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100875#comment-14100875 ] ramkrishna.s.vasudevan commented on HBASE-11591: bq.setSequenceId() should be added to Cell interface - in another issue. I don't think we can add setSequenceId() in Cell. We can discuss on that. Will update the patch. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101080#comment-14101080 ] Andrew Purtell commented on HBASE-11591: bq. But setting the seqId on the read path would prevent us from using Cell based impl because Cell does not have it. What prevents us from adding seqID accessors as an additional interface extending Cell in hbase-server as Anoop proposed above? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101818#comment-14101818 ] ramkrishna.s.vasudevan commented on HBASE-11591: bq.Or do we need an extension interface for Cell in server side which is having the setter? I think better we do only that. Is there any better way? The idea of cell to create different impl of it as per the case needed like how in BufferedDataEncoders the SeekerState is a Cell now. Infact everywhere the setSeqId() that we do in the server side should be changed. Do it in this JIRA or another JIRA? One thing to note that in the critical path we would any way have code that would create instances of that new impl. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101837#comment-14101837 ] Anoop Sam John commented on HBASE-11591: +1 for a new Jira. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101843#comment-14101843 ] Anoop Sam John commented on HBASE-11591: HBASE-11772 related issue. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101875#comment-14101875 ] ramkrishna.s.vasudevan commented on HBASE-11591: bq.+1 for a new Jira Ok will raise JIRA. bq.HBASE-11772 related issue. Better to see if this patch is useful in terms of HBASE-11772 also. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101880#comment-14101880 ] ramkrishna.s.vasudevan commented on HBASE-11591: Raised https://issues.apache.org/jira/browse/HBASE-11777. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, HBASE-11591_3.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100274#comment-14100274 ] ramkrishna.s.vasudevan commented on HBASE-11591: All the test case issues has this {code} Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:713) at org.mortbay.thread.QueuedThreadPool.newThread(QueuedThreadPool.java:462) at org.mortbay.thread.QueuedThreadPool.doStart(QueuedThreadPool.java:403) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.Server.doStart(Server.java:218) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.apache.hadoop.hbase.http.HttpServer.start(HttpServer.java:949) at org.apache.hadoop.hbase.http.InfoServer.start(InfoServer.java:78) at org.apache.hadoop.hbase.regionserver.HRegionServer.putUpWebUI(HRegionServer.java:1602) at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:520) at org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.init(MiniHBaseCluster.java:115) {code} Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100286#comment-14100286 ] ramkrishna.s.vasudevan commented on HBASE-11591: All the failed tests are passing. Let me rerun once again for HadoopQA. [~jeffreyz] Uses kv.setSequenceId() here. Pls have a look. But setting the seqId on the read path would prevent us from using Cell based impl because Cell does not have it. For now it is fine. [~saint@gmail.com],[~apurtell],[~anoop.hbase] Want to have a look at this? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, HBASE-11591_1.patch, HBASE-11591_2.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099642#comment-14099642 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662280/HBASE-11591_1.patch against trunk revision . ATTACHMENT ID: 12662280 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestBigDecimalColumnInterpreter org.apache.hadoop.hbase.coprocessor.TestMasterObserver org.apache.hadoop.hbase.mapred.TestTableSnapshotInputFormat org.apache.hadoop.hbase.coprocessor.TestDoubleColumnInterpreter org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithRemove org.apache.hadoop.hbase.replication.regionserver.TestReplicationSink org.apache.hadoop.hbase.coprocessor.TestMasterCoprocessorExceptionWithRemove org.apache.hadoop.hbase.io.encoding.TestLoadAndSwitchEncodeOnDisk org.apache.hadoop.hbase.mapred.TestTableInputFormat org.apache.hadoop.hbase.io.hfile.TestHFileBlock org.apache.hadoop.hbase.coprocessor.TestHTableWrapper org.apache.hadoop.hbase.coprocessor.TestCoprocessorEndpoint org.apache.hadoop.hbase.coprocessor.TestRowProcessorEndpoint org.apache.hadoop.hbase.coprocessor.TestRegionServerObserver org.apache.hadoop.hbase.coprocessor.TestClassLoading org.apache.hadoop.hbase.coprocessor.TestOpenTableInCoprocessor org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort org.apache.hadoop.hbase.coprocessor.TestBatchCoprocessorEndpoint org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.TestRegionRebalancing org.apache.hadoop.hbase.TestIOFencing org.apache.hadoop.hbase.zookeeper.TestZooKeeperACL org.apache.hadoop.hbase.coprocessor.TestRegionObserverBypass org.apache.hadoop.hbase.coprocessor.TestAggregateProtocol {color:red}-1 core zombie tests{color}. There are 4 zombie test(s): at org.apache.hadoop.hbase.io.hfile.TestScannerSelectionUsingTTL.testScannerSelection(TestScannerSelectionUsingTTL.java:128) at org.apache.hadoop.hbase.io.encoding.TestEncodedSeekers.testEncodedSeeker(TestEncodedSeekers.java:117) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10468//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings:
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099346#comment-14099346 ] Jeffrey Zhong commented on HBASE-11591: --- {quote} Should we rewrite the KV before sending it to the StoreScanner layer so that the kv comparison works fine? {quote} We can set mvcc(now the function name is setSequenceId()) when reading KVs from bulk loaded hfiles using hfile's sequence Id. This way is cleaner and can solve the issue that a new hfile may contain some KVs without mvcc when compacting a normal hfile with a bulk loaded hfile. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098186#comment-14098186 ] ramkrishna.s.vasudevan commented on HBASE-11591: Yes logically right. But the thing is in this case it is retrieving a kv which is smaller than the previous KV just because of the mvcc of the bulk loaded file. Should we rewrite the KV before sending it to the StoreScanner layer so that the kv comparison works fine? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095928#comment-14095928 ] ramkrishna.s.vasudevan commented on HBASE-11591: Any thoughts here? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096324#comment-14096324 ] Jeffrey Zhong commented on HBASE-11591: --- The patch looks good to me(+1)! Basically it restores old behavior for bulk loaded files by treating all KVs involved with mvcc==0. Thanks. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080548#comment-14080548 ] ramkrishna.s.vasudevan commented on HBASE-11591: Other test case seems to fail on some env issues. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080547#comment-14080547 ] ramkrishna.s.vasudevan commented on HBASE-11591: Not sure on other test cases failures but the new test case added TestScannerWithBulkLoad fails here {code} protected void checkScanOrder(Cell prevKV, Cell kv, KeyValue.KVComparator comparator) throws IOException { // Check that the heap gives us KVs in an increasing order. assert prevKV == null || comparator == null || comparator.compare(prevKV, kv) = 0 : Key + prevKV + followed by a + smaller key + kv + in cf + store; } {code} So can we remove that assertion? This change is becoming trickier. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080549#comment-14080549 ] ramkrishna.s.vasudevan commented on HBASE-11591: {code} Error: java.lang.NullPointerException at org.apache.hadoop.hbase.mapreduce.LabelExpander.getLabelOrdinals(LabelExpander.java:129) at org.apache.hadoop.hbase.mapreduce.LabelExpander.getLabelOrdinals(LabelExpander.java:145) at org.apache.hadoop.hbase.mapreduce.LabelExpander.createVisibilityTags(LabelExpander.java:105) at org.apache.hadoop.hbase.mapreduce.LabelExpander.createKVFromCellVisibilityExpr(LabelExpander.java:217) at org.apache.hadoop.hbase.mapreduce.TsvImporterMapper.createPuts(TsvImporterMapper.java:195) at org.apache.hadoop.hbase.mapreduce.TsvImporterMapper.map(TsvImporterMapper.java:153) at org.apache.hadoop.hbase.mapreduce.TsvImporterMapper.map(TsvImporterMapper.java:39) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) {code} Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080367#comment-14080367 ] Hadoop QA commented on HBASE-11591: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658788/HBASE-11591.patch against trunk revision . ATTACHMENT ID: 12658788 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction org.apache.hadoop.hbase.migration.TestNamespaceUpgrade org.apache.hadoop.hbase.regionserver.TestScannerWithBulkload org.apache.hadoop.hbase.master.TestMasterOperationsForRegionReplicas org.apache.hadoop.hbase.regionserver.TestRegionReplicas org.apache.hadoop.hbase.mapreduce.TestImportTSVWithVisibilityLabels org.apache.hadoop.hbase.client.TestReplicasClient org.apache.hadoop.hbase.master.TestRestartCluster org.apache.hadoop.hbase.TestIOFencing {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10236//console This message is automatically generated. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: HBASE-11591.patch, TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076936#comment-14076936 ] ramkrishna.s.vasudevan commented on HBASE-11591: [~saint@gmail.com],[~stack],[~jeffreyz] Want to take a look at this? Should the KVScannerComparator take into consideration before comparing the cell, whether the cell comes from a bulk loaded file? Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077290#comment-14077290 ] Jeffrey Zhong commented on HBASE-11591: --- I think this can be solved when reading bulk load hfiles, we can use current hfile sequenceId as its KVs' mvcc values. The other option is to add a new metadata into hfile as default mvcc value for situations like bulk loaded hfiles. Thanks. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077389#comment-14077389 ] ramkrishna.s.vasudevan commented on HBASE-11591: bq.we can use current hfile sequenceId as its KVs' mvcc values. Ya this is what I too felt. But to do that we may need to add some methods and ensure that the KeyValueHeap is of type StoreFileScanner and the reader on that is of bulk loaded file. Will post a patch soon. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075423#comment-14075423 ] Andrew Purtell commented on HBASE-11591: Thanks for the clarification! Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0 Attachments: TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file
[ https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074880#comment-14074880 ] Andrew Purtell commented on HBASE-11591: Making critical for .5. It seems to me we should be respecting the file level sequence in 0.98 as we did in 0.96, and not doing so is a bulk loading bug. Feel free to adjust priority downward if you disagree. Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file --- Key: HBASE-11591 URL: https://issues.apache.org/jira/browse/HBASE-11591 Project: HBase Issue Type: Bug Affects Versions: 0.99.0, 0.98.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.99.0, 0.98.5 Attachments: TestBulkload.java See discussion in HBASE-11339. When we have a case where there are same KVs in two files one produced by flush/compaction and the other thro the bulk load. Both the files have some same kvs which matches even in timestamp. Steps: Add some rows with a specific timestamp and flush the same. Bulk load a file with the same data.. Enusre that assign seqnum property is set. The bulk load should use HFileOutputFormat2 (or ensure that we write the bulk_time_output key). This would ensure that the bulk loaded file has the highest seq num. Assume the cell in the flushed/compacted store file is row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is row1,cf,cq,ts1,value2 (There are no parallel scans). Issue a scan on the table in 0.96. The retrieved value is row1,cf1,cq,ts1,value2 But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. This is a behaviour change. This is because of this code {code} public int compare(KeyValueScanner left, KeyValueScanner right) { int comparison = compare(left.peek(), right.peek()); if (comparison != 0) { return comparison; } else { // Since both the keys are exactly the same, we break the tie in favor // of the key which came latest. long leftSequenceID = left.getSequenceID(); long rightSequenceID = right.getSequenceID(); if (leftSequenceID rightSequenceID) { return -1; } else if (leftSequenceID rightSequenceID) { return 1; } else { return 0; } } } {code} Here in 0.96 case the mvcc of the cell in both the files will have 0 and so the comparison will happen from the else condition . Where the seq id of the bulk loaded file is greater and would sort out first ensuring that the scan happens from that bulk loaded file. In case of 0.98+ as we are retaining the mvcc+seqid we are not making the mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out the cell in the flushed/compacted file. Which means though we know the lateset file is the bulk loaded file we don't scan the data. Seems to be a behaviour change. Will check on other corner cases also but we are trying to know the behaviour of bulk load because we are evaluating if it can be used for MOB design. -- This message was sent by Atlassian JIRA (v6.2#6252)