[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-11-27 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13834297#comment-13834297
 ] 

Lars Hofhansl commented on HBASE-4433:
--

reseek was also dramatically improved with HBASE-9915 if a block encoder is 
used.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-05 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593210#comment-13593210
 ] 

ramkrishna.s.vasudevan commented on HBASE-4433:
---

bq.b.t.w how to modify exist comment? Find no way to do it, while it seems some 
one could modify their comment.
You need admin access for that.  
Your above points makes sense.  Was going thro the code and hence got the 
doubt.  


 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-04 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592938#comment-13592938
 ] 

Raymond Liu commented on HBASE-4433:


To figure out how much overhead the seek will have. I read a few more code. My 
table is major compacted. And it seems that under this situation. The lazy seek 
approaching doesn't help. since there are only 1 scanner involved. Still each 
time this scanner will go through a lazy seek, then add to heap , sort, poll 
out , for a second real seek. it introduce one extra lazy seek and construction 
of a second fake key for seek. And the best path should be go direct seek 
without this lazy seek when there are only 1 storefilescanner is involved ( or 
1 storefilescanner + 1 memstorescanner?). And I tweak the code a little bit to 
find out how much it will impact the result. it show to me the scan time is 
reduced from 260s to 240s for include_and_seek, though still far from 190s for 
include then seek since there are still one seek involved which is expensive 
than next.

However I find it hard to do thing right if you want to switch from lazy seek 
to non_lazy seek later. try to read more code to find a solution.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-04 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593022#comment-13593022
 ] 

ramkrishna.s.vasudevan commented on HBASE-4433:
---

Nice findings Liu.  As Lars pointed out we can work on improvments here.  Add 
some intelligence or some mathematics to figure out which path to take under 
what condition.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-04 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593129#comment-13593129
 ] 

ramkrishna.s.vasudevan commented on HBASE-4433:
---

bq. The lazy seek approaching doesn't help. since there are only 1 scanner 
involved.
Can you brief more on this.  Basically lazy seek helps to reduce the numbers of 
hFiles to be seeked right?

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-04 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593145#comment-13593145
 ] 

Raymond Liu commented on HBASE-4433:


Right, Lazy seek try to avoid seek in old hfiles when possible. While for my 
case, there are only 1 hFiles for Major compact is done. And also , during 
scan, storeFileScanner could be closed when done. Thus sooner or later, there 
will be only one storeFileScanner remain.

And there are various other situation. say if you need to scan all version of 
data, in this case, a lazy seek just push the real seek later. But do not 
reduce the number of real seek.

In both case, lazy seek will add overheads.

Of course, when there are a lot of hfiles with different version of rows , and 
you just want to get the first version out of it. in this case lazy seek will 
provide helps.


 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-04 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593147#comment-13593147
 ] 

Raymond Liu commented on HBASE-4433:


And also , when there are only one version of row exist, no matter how many 
hfile you have, a sequence scan operation will always need scan all the hfile 
row by row. you don't skip any real seek by lazy seek. And in many case, like 
hive on top of hbase or a bulkloaded read only table, I think it's quite normal 
that a row only got one version.

b.t.w how to modify exist comment? Find no way to do it, while it seems some 
one could modify their comment.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-03-01 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590820#comment-13590820
 ] 

Lars Hofhansl commented on HBASE-4433:
--

Thanks Raymond. Seems like there's room for improvement in many scenarios. I'll 
also do some tests.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589333#comment-13589333
 ] 

Raymond Liu commented on HBASE-4433:


I have run another test, say with the same 200G 18 column table, I do scan on 
every other column.
Thus with include then seek approaching, it will be c1 - next c2 - seek c3 - 
next c4 - seek c5 ...
And with include_and_seek approaching, it will be c1 - seek c3 - seek c5 ...

Say, an extra next is involved for each seek op. And this is the worst case for 
include then seek approaching. While in my case, this two approaching don't 
show noticeable performance difference. say all around 207s. While for the 
previous best case(c1-next c2- next c3 v.s. c1-seek c2-seek c3) 190s vs 
250s.

So, if the next() op do not involve extra block loading, I think this is 
acceptable.
And for extra block loading, only happens when the next col is in next block, 
and it fully occupy the next block. This could be rare ( either col is huge, in 
this case, default block size should be adjusted? or history version is huge, 
in this case, only when the current kv happen to be the very last kv in current 
block, and the next block is all occupied by history versions)

And also, the wildcolumntracker now go with include and seek approaching when 
max version is achieved.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Kannan Muthukkaruppan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589705#comment-13589705
 ] 

Kannan Muthukkaruppan commented on HBASE-4433:
--

Sorry for missing this thread. Will post a more detailed reply when I am at
the computer. In a later jira we fixed it such that seek is really cheap if
it is to a key within the same block. No need for log(n) walk thru the
index if key we are seeking to is in the same block.




 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589722#comment-13589722
 ] 

Lars Hofhansl commented on HBASE-4433:
--

Thanks Kannan. Looks like something we should into the 0.94/0.95/trunk branches 
as well (assuming from Raymond's numbers that this change is only in the FB 
branch).


 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589763#comment-13589763
 ] 

Liyin Tang commented on HBASE-4433:
---

Hi Lars, the jira Kannan mentioned is [HBASE-5987] HFileBlockIndex 
improvements. By looking ahead at the next indexed key, HBase internal reader 
knows whether to keep scanning the current DataBlock or look up the index. This 
feature avoids additional index lookup overhead when multiple requests are 
sequentially scanning the HFile data block.

Actually, we have a list of jiras in our FB internal HBase release. Do you know 
a proper place we could share these work with more hbase-dev ?

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Kannan Muthukkaruppan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589767#comment-13589767
 ] 

Kannan Muthukkaruppan commented on HBASE-4433:
--

The relevant JIRA that addresses this issue is: HBASE-5987.

Basically, whenever we go done an index, we also lookahead and maintain the 
start key of the next block in the HFileScanner state. When a need to reseek to 
a key arises, we do a quick check to see if the key is in the same block (i.e. 
is less than the start key of the next block). If it is, the reseek doesn't 
need to consult the index again and can simple march along in the same block to 
find the key; else, it uses the index to find the block it needs to go to.

Looks like this was fixed in 0.95. Raymond: Which version are you trying this 
with?
---

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589776#comment-13589776
 ] 

Ted Yu commented on HBASE-4433:
---

HBASE-5987 has been ported to 0.94 through HBASE-6032
Meaning, the improvement is in 0.94.3

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589784#comment-13589784
 ] 

Lars Hofhansl commented on HBASE-4433:
--

Thanks Liyin, Kannan, and Ted :)

[~colorant] Which version of HBase did you use for your tests?

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590138#comment-13590138
 ] 

Raymond Liu commented on HBASE-4433:


Hi, I did this test in 0.94.1 , but I have already port HBASE-6032 onto it. 
without this patch, the difference is even larger.

So this is not about index key issue.
I think the overhead is that the fake key need to be construct for a seek 
operation. And still the seek op itself slightly expensive than get op.


 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-28 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590194#comment-13590194
 ] 

Raymond Liu commented on HBASE-4433:


Anyway, To make sure no other issue might impact on the result. I do the same 
test again upon 0.94.5. And with similar result.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-27 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588107#comment-13588107
 ] 

Raymond Liu commented on HBASE-4433:


I got a issue here related to this one. For a table which do not have multiple 
version for it's row. each row only got a single version. thus, a next 
operation will read in the next column's keyvalue and match the next column 
without a seek operation. In this case, this next() operation is actually save 
the time and improve the performance. With a 200G table to scan in my test, 
next instead of seek with be 30% faster. say 190s v.s. 250s.

So I think this behavior might need to be treat differently for different 
situation. For I think this one version each row read only table is also very 
typical case. And this patch actually make the performance worse.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-27 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588122#comment-13588122
 ] 

ramkrishna.s.vasudevan commented on HBASE-4433:
---

Reading the description of JIRA i understand it was basically done for large 
blobs.  Hence they tried to seek and then next() so that unnecessary block seek 
does not happen.
So your case is a plain case where you just need the next column.

Any suggestions how to go about with this?  Can we have some configuration?

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-27 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589068#comment-13589068
 ] 

Raymond Liu commented on HBASE-4433:


I am wondering, we might add a conf to let user choose the strategy to allow 
include_and_seek or just separate include/seek. However, the difference of this 
kind of settings might not be easy to be figure out by an end user. And whether 
the table have many history versions or not also totally depends on the usage 
of the table. Better to have some auto select mechanism to help with it. 

If the table is mainly go with one time write/many read mode, only user know 
it, I don't know is there any way to find out this by hbase itself?

While if table is configed with MAX history VERSION set to 1 etc, Then for most 
chance I guess it is safe for the column tracker to go with separate 
include/seek approaching.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-27 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589206#comment-13589206
 ] 

ramkrishna.s.vasudevan commented on HBASE-4433:
---

I agree Raymond with you on the part that end user cannot figure it out.  
But having a config knob will atleast help in understanding the behaviour of 
the application and then decide on the nature of the include/seek mechanism.  
Also having a knob will atleast help users not to recompile code by making 
changes in the code.  Just saying. 
But still will there be a chance that the 
bq.When we are done with the requested column(s) the code still does an extra 
next() call before it realizes that it is actually done. This extra next() call 
could potentially result in an unnecessary extra block load
This may happen. 

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-27 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589221#comment-13589221
 ] 

Raymond Liu commented on HBASE-4433:


You are right, there are chance that an extra next() will be called. And for a 
large kv that occupy a single block, this might have it load an unnecessary 
extra block, while for most case if the single kv is not that big, then the 
next block always need to be loaded even for seek_next_col, seek_next_row might 
not if it involves a lot of cols that one row span multi blocks.

And, if not for an extra big KV, for multi history version columns, this extra 
next might not cost much even it actually need to be seek through, for It save 
part of the time for seek since it is already passed. Anyway, it will need real 
case to verify the performance impact.

And , Yes, I agree with you, if we can't tell which mechanism should be used, a 
configure is very useful.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2013-02-27 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589224#comment-13589224
 ] 

Lars Hofhansl commented on HBASE-4433:
--

Interesting! This is almost impossible to get right automatically I think. Even 
with MAX_VERSIONS=1 there might be a bunch of version, where INCLUDE_AND_SEEK_* 
is better.

Could use the size of the KV as a guidepost. If MAX_VERSIONS * size  than the 
HFile blocksize (64k by default) we could do INCLUDE_AND_SEEK, other do INCLUDE 
following by SEEK (if needed).

(Just made this up, but we can probably use some heuristic like this)

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-27 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115353#comment-13115353
 ] 

Hudson commented on HBASE-4433:
---

Integrated in HBase-TRUNK #2261 (See 
[https://builds.apache.org/job/HBase-TRUNK/2261/])
HBASE-4433  avoid extra next (potentially a seek) if done with column/row 
(kannan via jgray)

jgray : 
Files : 
* /hbase/trunk/CHANGES.txt
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java


 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.94.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-27 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115743#comment-13115743
 ] 

Andrew Purtell commented on HBASE-4433:
---

According to my tests, this is safe to do on 0.92 and 0.90 branches as well. 
This change should be applied there.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.94.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-27 Thread Jonathan Gray (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115867#comment-13115867
 ] 

Jonathan Gray commented on HBASE-4433:
--

Is this not strictly an improvement/feature?  It seems like it doesn't belong 
in stable branches :)

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-27 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115938#comment-13115938
 ] 

Hudson commented on HBASE-4433:
---

Integrated in HBase-0.92 #23 (See 
[https://builds.apache.org/job/HBase-0.92/23/])
HBASE-4433: avoid extra next (potentially a seek) if done with column/row
HBASE-4433: avoid extra next (potentially a seek) if done with column/row

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java


 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
 Fix For: 0.92.0


 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-26 Thread Kannan Muthukkaruppan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115113#comment-13115113
 ] 

Kannan Muthukkaruppan commented on HBASE-4433:
--

ping. for code review.

test suite ran clean.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan

 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-26 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115191#comment-13115191
 ] 

Ted Yu commented on HBASE-4433:
---

+1 on patch.
Nice work.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan

 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-24 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114113#comment-13114113
 ] 

jirapos...@reviews.apache.org commented on HBASE-4433:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2044/
---

Review request for Michael Stack, Jonathan Gray and Mikhail Bautin.


Summary
---

Avoids extra next (potentially seek) calls when we are done with each column 
requested.


This addresses bug HBASE-4433.
https://issues.apache.org/jira/browse/HBASE-4433


Diffs
-

  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ExplicitColumnTracker.java
 1175286 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
 1175286 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
 1175286 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksRead.java
 1175286 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestExplicitColumnTracker.java
 1175286 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java
 1175286 

Diff: https://reviews.apache.org/r/2044/diff


Testing
---

Ran TestBlocksRead/TestExplicitColumnTracker/TestQueryMatcher. Running the full 
suite now.


Thanks,

Kannan



 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Improvement
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan

 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

2011-09-19 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108110#comment-13108110
 ] 

Jonathan Gray commented on HBASE-4433:
--

Good stuff.  I think the first iteration of the ColumnTracker had the 
INCLUDE_AND_* primitives but it was simplified.  Would be pretty cool that 
write up a unit test that creates single-KV sized blocks and you could run 
various queries to see the number of blocks accessed.  Especially nice to catch 
regressions in the future.

 avoid extra next (potentially a seek) if done with column/row
 -

 Key: HBASE-4433
 URL: https://issues.apache.org/jira/browse/HBASE-4433
 Project: HBase
  Issue Type: Bug
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan

 [Noticed this in 89, but quite likely true of trunk as well.]
 When we are done with the requested column(s) the code still does an extra 
 next() call before it realizes that it is actually done. This extra next() 
 call could potentially result in an unnecessary extra block load. This is 
 likely to be especially bad for CFs where the KVs are large blobs where each 
 KV may be occupying a block of its own. So the next() can often load a new 
 unrelated block unnecessarily.
 --
 For the simple case of reading say the top-most column in a row in a single 
 file, where each column (KV) was say a block of its own-- it seems that we 
 are reading 3 blocks, instead of 1 block!
 I am working on a simple patch and with that the number of seeks is down to 
 2. 
 [There is still an extra seek left.  I think there were two levels of 
 extra/unnecessary next() we were doing without actually confirming that the 
 next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
 diff avoids. I think the other is at hfs.next() (at the storefile scanner 
 level) that's happening whenever a HFile scanner servers out a data-- and 
 perhaps that's the additional seek that we need to avoid. But I want to 
 tackle this optimization first as the two issues seem unrelated.]
 -- 
 The basic idea of the patch I am working on/testing is as follows. The 
 ExplicitColumnTracker currently returns INCLUDE to the ScanQueryMatcher if 
 the KV needs to be included and then if done, only in the the next call it 
 returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
 when ExplicitColumnTracker knows it is done with a particular column/row, the 
 patch attempts to combine the INCLUDE code and done hint into a single match 
 code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira