[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13834297#comment-13834297 ] Lars Hofhansl commented on HBASE-4433: -- reseek was also dramatically improved with HBASE-9915 if a block encoder is used. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593210#comment-13593210 ] ramkrishna.s.vasudevan commented on HBASE-4433: --- bq.b.t.w how to modify exist comment? Find no way to do it, while it seems some one could modify their comment. You need admin access for that. Your above points makes sense. Was going thro the code and hence got the doubt. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592938#comment-13592938 ] Raymond Liu commented on HBASE-4433: To figure out how much overhead the seek will have. I read a few more code. My table is major compacted. And it seems that under this situation. The lazy seek approaching doesn't help. since there are only 1 scanner involved. Still each time this scanner will go through a lazy seek, then add to heap , sort, poll out , for a second real seek. it introduce one extra lazy seek and construction of a second fake key for seek. And the best path should be go direct seek without this lazy seek when there are only 1 storefilescanner is involved ( or 1 storefilescanner + 1 memstorescanner?). And I tweak the code a little bit to find out how much it will impact the result. it show to me the scan time is reduced from 260s to 240s for include_and_seek, though still far from 190s for include then seek since there are still one seek involved which is expensive than next. However I find it hard to do thing right if you want to switch from lazy seek to non_lazy seek later. try to read more code to find a solution. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593022#comment-13593022 ] ramkrishna.s.vasudevan commented on HBASE-4433: --- Nice findings Liu. As Lars pointed out we can work on improvments here. Add some intelligence or some mathematics to figure out which path to take under what condition. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593129#comment-13593129 ] ramkrishna.s.vasudevan commented on HBASE-4433: --- bq. The lazy seek approaching doesn't help. since there are only 1 scanner involved. Can you brief more on this. Basically lazy seek helps to reduce the numbers of hFiles to be seeked right? avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593145#comment-13593145 ] Raymond Liu commented on HBASE-4433: Right, Lazy seek try to avoid seek in old hfiles when possible. While for my case, there are only 1 hFiles for Major compact is done. And also , during scan, storeFileScanner could be closed when done. Thus sooner or later, there will be only one storeFileScanner remain. And there are various other situation. say if you need to scan all version of data, in this case, a lazy seek just push the real seek later. But do not reduce the number of real seek. In both case, lazy seek will add overheads. Of course, when there are a lot of hfiles with different version of rows , and you just want to get the first version out of it. in this case lazy seek will provide helps. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593147#comment-13593147 ] Raymond Liu commented on HBASE-4433: And also , when there are only one version of row exist, no matter how many hfile you have, a sequence scan operation will always need scan all the hfile row by row. you don't skip any real seek by lazy seek. And in many case, like hive on top of hbase or a bulkloaded read only table, I think it's quite normal that a row only got one version. b.t.w how to modify exist comment? Find no way to do it, while it seems some one could modify their comment. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590820#comment-13590820 ] Lars Hofhansl commented on HBASE-4433: -- Thanks Raymond. Seems like there's room for improvement in many scenarios. I'll also do some tests. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589333#comment-13589333 ] Raymond Liu commented on HBASE-4433: I have run another test, say with the same 200G 18 column table, I do scan on every other column. Thus with include then seek approaching, it will be c1 - next c2 - seek c3 - next c4 - seek c5 ... And with include_and_seek approaching, it will be c1 - seek c3 - seek c5 ... Say, an extra next is involved for each seek op. And this is the worst case for include then seek approaching. While in my case, this two approaching don't show noticeable performance difference. say all around 207s. While for the previous best case(c1-next c2- next c3 v.s. c1-seek c2-seek c3) 190s vs 250s. So, if the next() op do not involve extra block loading, I think this is acceptable. And for extra block loading, only happens when the next col is in next block, and it fully occupy the next block. This could be rare ( either col is huge, in this case, default block size should be adjusted? or history version is huge, in this case, only when the current kv happen to be the very last kv in current block, and the next block is all occupied by history versions) And also, the wildcolumntracker now go with include and seek approaching when max version is achieved. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589705#comment-13589705 ] Kannan Muthukkaruppan commented on HBASE-4433: -- Sorry for missing this thread. Will post a more detailed reply when I am at the computer. In a later jira we fixed it such that seek is really cheap if it is to a key within the same block. No need for log(n) walk thru the index if key we are seeking to is in the same block. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589722#comment-13589722 ] Lars Hofhansl commented on HBASE-4433: -- Thanks Kannan. Looks like something we should into the 0.94/0.95/trunk branches as well (assuming from Raymond's numbers that this change is only in the FB branch). avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589763#comment-13589763 ] Liyin Tang commented on HBASE-4433: --- Hi Lars, the jira Kannan mentioned is [HBASE-5987] HFileBlockIndex improvements. By looking ahead at the next indexed key, HBase internal reader knows whether to keep scanning the current DataBlock or look up the index. This feature avoids additional index lookup overhead when multiple requests are sequentially scanning the HFile data block. Actually, we have a list of jiras in our FB internal HBase release. Do you know a proper place we could share these work with more hbase-dev ? avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589767#comment-13589767 ] Kannan Muthukkaruppan commented on HBASE-4433: -- The relevant JIRA that addresses this issue is: HBASE-5987. Basically, whenever we go done an index, we also lookahead and maintain the start key of the next block in the HFileScanner state. When a need to reseek to a key arises, we do a quick check to see if the key is in the same block (i.e. is less than the start key of the next block). If it is, the reseek doesn't need to consult the index again and can simple march along in the same block to find the key; else, it uses the index to find the block it needs to go to. Looks like this was fixed in 0.95. Raymond: Which version are you trying this with? --- avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589776#comment-13589776 ] Ted Yu commented on HBASE-4433: --- HBASE-5987 has been ported to 0.94 through HBASE-6032 Meaning, the improvement is in 0.94.3 avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589784#comment-13589784 ] Lars Hofhansl commented on HBASE-4433: -- Thanks Liyin, Kannan, and Ted :) [~colorant] Which version of HBase did you use for your tests? avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590138#comment-13590138 ] Raymond Liu commented on HBASE-4433: Hi, I did this test in 0.94.1 , but I have already port HBASE-6032 onto it. without this patch, the difference is even larger. So this is not about index key issue. I think the overhead is that the fake key need to be construct for a seek operation. And still the seek op itself slightly expensive than get op. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590194#comment-13590194 ] Raymond Liu commented on HBASE-4433: Anyway, To make sure no other issue might impact on the result. I do the same test again upon 0.94.5. And with similar result. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588107#comment-13588107 ] Raymond Liu commented on HBASE-4433: I got a issue here related to this one. For a table which do not have multiple version for it's row. each row only got a single version. thus, a next operation will read in the next column's keyvalue and match the next column without a seek operation. In this case, this next() operation is actually save the time and improve the performance. With a 200G table to scan in my test, next instead of seek with be 30% faster. say 190s v.s. 250s. So I think this behavior might need to be treat differently for different situation. For I think this one version each row read only table is also very typical case. And this patch actually make the performance worse. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588122#comment-13588122 ] ramkrishna.s.vasudevan commented on HBASE-4433: --- Reading the description of JIRA i understand it was basically done for large blobs. Hence they tried to seek and then next() so that unnecessary block seek does not happen. So your case is a plain case where you just need the next column. Any suggestions how to go about with this? Can we have some configuration? avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589068#comment-13589068 ] Raymond Liu commented on HBASE-4433: I am wondering, we might add a conf to let user choose the strategy to allow include_and_seek or just separate include/seek. However, the difference of this kind of settings might not be easy to be figure out by an end user. And whether the table have many history versions or not also totally depends on the usage of the table. Better to have some auto select mechanism to help with it. If the table is mainly go with one time write/many read mode, only user know it, I don't know is there any way to find out this by hbase itself? While if table is configed with MAX history VERSION set to 1 etc, Then for most chance I guess it is safe for the column tracker to go with separate include/seek approaching. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589206#comment-13589206 ] ramkrishna.s.vasudevan commented on HBASE-4433: --- I agree Raymond with you on the part that end user cannot figure it out. But having a config knob will atleast help in understanding the behaviour of the application and then decide on the nature of the include/seek mechanism. Also having a knob will atleast help users not to recompile code by making changes in the code. Just saying. But still will there be a chance that the bq.When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load This may happen. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589221#comment-13589221 ] Raymond Liu commented on HBASE-4433: You are right, there are chance that an extra next() will be called. And for a large kv that occupy a single block, this might have it load an unnecessary extra block, while for most case if the single kv is not that big, then the next block always need to be loaded even for seek_next_col, seek_next_row might not if it involves a lot of cols that one row span multi blocks. And, if not for an extra big KV, for multi history version columns, this extra next might not cost much even it actually need to be seek through, for It save part of the time for seek since it is already passed. Anyway, it will need real case to verify the performance impact. And , Yes, I agree with you, if we can't tell which mechanism should be used, a configure is very useful. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589224#comment-13589224 ] Lars Hofhansl commented on HBASE-4433: -- Interesting! This is almost impossible to get right automatically I think. Even with MAX_VERSIONS=1 there might be a bunch of version, where INCLUDE_AND_SEEK_* is better. Could use the size of the KV as a guidepost. If MAX_VERSIONS * size than the HFile blocksize (64k by default) we could do INCLUDE_AND_SEEK, other do INCLUDE following by SEEK (if needed). (Just made this up, but we can probably use some heuristic like this) avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115353#comment-13115353 ] Hudson commented on HBASE-4433: --- Integrated in HBase-TRUNK #2261 (See [https://builds.apache.org/job/HBase-TRUNK/2261/]) HBASE-4433 avoid extra next (potentially a seek) if done with column/row (kannan via jgray) jgray : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.94.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115743#comment-13115743 ] Andrew Purtell commented on HBASE-4433: --- According to my tests, this is safe to do on 0.92 and 0.90 branches as well. This change should be applied there. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.94.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115867#comment-13115867 ] Jonathan Gray commented on HBASE-4433: -- Is this not strictly an improvement/feature? It seems like it doesn't belong in stable branches :) avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115938#comment-13115938 ] Hudson commented on HBASE-4433: --- Integrated in HBase-0.92 #23 (See [https://builds.apache.org/job/HBase-0.92/23/]) HBASE-4433: avoid extra next (potentially a seek) if done with column/row HBASE-4433: avoid extra next (potentially a seek) if done with column/row stack : Files : * /hbase/branches/0.92/CHANGES.txt stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan Fix For: 0.92.0 [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115113#comment-13115113 ] Kannan Muthukkaruppan commented on HBASE-4433: -- ping. for code review. test suite ran clean. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115191#comment-13115191 ] Ted Yu commented on HBASE-4433: --- +1 on patch. Nice work. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114113#comment-13114113 ] jirapos...@reviews.apache.org commented on HBASE-4433: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2044/ --- Review request for Michael Stack, Jonathan Gray and Mikhail Bautin. Summary --- Avoids extra next (potentially seek) calls when we are done with each column requested. This addresses bug HBASE-4433. https://issues.apache.org/jira/browse/HBASE-4433 Diffs - http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java 1175286 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 1175286 http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 1175286 http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java 1175286 http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java 1175286 http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java 1175286 Diff: https://reviews.apache.org/r/2044/diff Testing --- Ran TestBlocksRead/TestExplicitColumnTracker/TestQueryMatcher. Running the full suite now. Thanks, Kannan avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Improvement Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row
[ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108110#comment-13108110 ] Jonathan Gray commented on HBASE-4433: -- Good stuff. I think the first iteration of the ColumnTracker had the INCLUDE_AND_* primitives but it was simplified. Would be pretty cool that write up a unit test that creates single-KV sized blocks and you could run various queries to see the number of blocks accessed. Especially nice to catch regressions in the future. avoid extra next (potentially a seek) if done with column/row - Key: HBASE-4433 URL: https://issues.apache.org/jira/browse/HBASE-4433 Project: HBase Issue Type: Bug Reporter: Kannan Muthukkaruppan Assignee: Kannan Muthukkaruppan [Noticed this in 89, but quite likely true of trunk as well.] When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily. -- For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block! I am working on a simple patch and with that the number of seeks is down to 2. [There is still an extra seek left. I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.] -- The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira