[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-25 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216370#comment-13216370
 ] 

Max Lapan commented on HBASE-5416:
--

@all
Thanks for a discussion, I'll benchmark 2-phase approach, maybe it's a solution 
indeed.
One thing still is not clear for me: how the batching factor could improve gets 
performance? The get request is synchronous, isn't it? So, in mapper, I issue 
get to obtain value from large column, and wait for it to be ready. In fact, 
single get won't be significantly havier than seek in scanner, but batching 
seems no help there. In fact, could be wrong there, didn't experiment much with 
that.


> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, 
> Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-24 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215771#comment-13215771
 ] 

Max Lapan commented on HBASE-5416:
--

@Nicolas:
Still, have no idea how to resolve our slow scans problem different way. 
Two-phase rpc would be very inefficient in map-reduce job, when we need to 
issue lots of gets for each obtained 'flag' row and and have no good place to 
save them for multi-get (which could be huge in some cases). Batching also have 
little help there, because slowness not caused by a large Results, but tons of 
useless work, performed by a regionserver on such scans. Or, maybe, I missed 
something?

I agree that this solution is not elegant and complicates scan machinery, but 
all other approaches looks worse.

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, 
> Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-24 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215550#comment-13215550
 ] 

Max Lapan commented on HBASE-5416:
--

@stack:
Documentation paragraph to include. I think it should go there: 
http://hbase.apache.org/book.html#number.of.cfs
{quote}
There is a performance option to keep in mind on schema design. In some situations, two (or more) columns family schema could be much faster than a single-CF design. It could be the case when you have one column which is used to sieve larger rows from other columns. If SingleColumnValueFilter or SingleColumnValueExcludeFilter is used to find the needed rows, only a small column is scanned, other columns are  loaded only when matching row has been found. This could reduce the amount of data loaded significantly and lead to faster scans.
{quote}

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered_scans.patch, Filtered_scans_v2.patch, 
> Filtered_scans_v3.patch, Filtered_scans_v4.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-24 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215529#comment-13215529
 ] 

Max Lapan commented on HBASE-5416:
--

@Zhihong: Have trouble with post new review request - it gives 500 error. Maybe 
this is related with apache jira issues, will try later.

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered_scans.patch, Filtered_scans_v2.patch, 
> Filtered_scans_v3.patch, Filtered_scans_v4.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-24 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215504#comment-13215504
 ] 

Max Lapan commented on HBASE-5416:
--

@Thomas:

Yes, this is the primary goal of this patch. When CF_B is large, we'll load 
only needed blocks from it (via seek), which could give a huge speedup in scan.

@Zhihong:

Thanks, I'll fix this, now waiting to jenkins results.
Didn't know about reviews.apache.org, thanks. I'll post there, of couse :).

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered_scans.patch, Filtered_scans_v2.patch, 
> Filtered_scans_v3.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-22 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213468#comment-13213468
 ] 

Max Lapan commented on HBASE-5416:
--

The problem was a little bit tricky than I expected.

The failed tests are caused by PageFilter and WhileMatchFilter expecting that 
filterRow method are called only once per non-empty row. Previous version of 
patch breaks this, so, tests are failed. I resolved this by checking that row 
is not empty right before filterRow(List) called, but this requires to slightly 
modify SingleColumnValueExcludeFilter logic - move exclude phase from 
filterKeyValue method to filterRow(List). The main reason for this is beacuse 
there is no way to distinguish at RegionScanner::nextInternal level empty row 
which is empty because of filter accepts row, but excludes all it's KVs and row 
which is empty due to filter rejects it.

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered_scans.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-21 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212565#comment-13212565
 ] 

Max Lapan commented on HBASE-5416:
--

Thanks for the instruction. It caused by internal filter state broken by my 
patch (filterRow called in wrong time). Working on that.

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered_scans.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-20 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212327#comment-13212327
 ] 

Max Lapan commented on HBASE-5416:
--

Quite probable, have plans to find this out today.

{quote}
Would appreciate some better doc in the filters package doc or over in the 
manual to accompany this change.
{quote}

I have a question about this. "Manual" == hbase book? And what 'filters package 
doc' is? :) Is it comments in source processed by javadoc, or somethinc else? 
Sorry for these questions - have no java experience ;).

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered_scans.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210041#comment-13210041
 ] 

Max Lapan commented on HBASE-5416:
--

@Nicolas: Hm, dont't understand how to implement 2-phase approach in map-reduce 
job, without extra complications. Also, haven't found 
Scan.setMaxResultsPerColumnFamily in current hbase source, only your patch for 
0.89.

Atomicity is not critical to us in this case - only performance and usage 
simplicity. 

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered-scans_0.90.4.patch, Filtered-scans_trunk.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209782#comment-13209782
 ] 

Max Lapan commented on HBASE-5416:
--

@stack ok

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered-scans_0.90.4.patch, Filtered-scans_trunk.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209778#comment-13209778
 ] 

Max Lapan commented on HBASE-5416:
--

In our case, multi-gets is not a solution, because in our schema we have much 
more than 2 CFs (it was only the example). We have 7 CFs, and different scans 
are using different sets of CFs. Also, we have no prior knowlege about how many 
rows will be accepted by filter, so, it could be too many gets.

Bloom filters also don't help much, because they filters whole files, not 
blocks as seek() does.


> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered-scans_0.90.4.patch, Filtered-scans_trunk.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

2012-02-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209764#comment-13209764
 ] 

Max Lapan commented on HBASE-5416:
--

Empty result is checked later, after values loaded from joinedHeap scanners. If 
check before load, we can lost rows when filter is 
SingleColumnValueExcludeFilter.

> Improve performance of scans with some kind of filters.
> ---
>
> Key: HBASE-5416
> URL: https://issues.apache.org/jira/browse/HBASE-5416
> Project: HBase
>  Issue Type: Improvement
>  Components: filters, performance, regionserver
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
> Attachments: Filtered-scans_0.90.4.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4799) Catalog Janitor logic bug causes region leackage

2011-11-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151792#comment-13151792
 ] 

Max Lapan commented on HBASE-4799:
--

Yes, this comment is obsolete now.

Combine splits removal in one go is also a good point. IIRC, 
MetaEditor::deleteDaughterReferenceInParent used only by janitor, so this could 
be changed easily.

I'll do this and update patch.

> Catalog Janitor logic bug causes region leackage
> 
>
> Key: HBASE-4799
> URL: https://issues.apache.org/jira/browse/HBASE-4799
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
>Priority: Critical
> Fix For: 0.92.0, 0.90.5
>
> Attachments: 0001-Fix-of-Regions-Leaks-problem-in-janitor.patch, 
> 0002-Temporary-fix-to-remove-leaked-regions.patch
>
>
> When region split takes a significant amount of time, CatalogJanitor can 
> cleanup one of SPLIT records, but left another in META. When another split 
> finish, janitor cleans left SPLIT record, but parent regions haven't removed 
> from FS and META not cleared.
> The race condition is follows:
> 1. region split started
> 2. one of regions splitted, i.e. A (have no reference storefiles) but other 
> (B) doesn't
> 3. janitor started and in routine checkDaughter removes SPLITA from meta, but 
> see that SPLITB has references and does nothing.
> 4. region B completes split
> 5. janitor wakes up, removes SPLITB, but see that there is no records for A 
> and does nothing again.
> Result - parent region hangs forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4799) Catalog Janitor logic bug causes region leackage

2011-11-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151370#comment-13151370
 ] 

Max Lapan commented on HBASE-4799:
--

I tried, didn't help. 4238 is a different issue, appeared when regions splitted 
twice in a row.

> Catalog Janitor logic bug causes region leackage
> 
>
> Key: HBASE-4799
> URL: https://issues.apache.org/jira/browse/HBASE-4799
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
>Priority: Critical
> Attachments: 0001-Fix-of-Regions-Leaks-problem-in-janitor.patch, 
> 0002-Temporary-fix-to-remove-leaked-regions.patch
>
>
> When region split takes a significant amount of time, CatalogJanitor can 
> cleanup one of SPLIT records, but left another in META. When another split 
> finish, janitor cleans left SPLIT record, but parent regions haven't removed 
> from FS and META not cleared.
> The race condition is follows:
> 1. region split started
> 2. one of regions splitted, i.e. A (have no reference storefiles) but other 
> (B) doesn't
> 3. janitor started and in routine checkDaughter removes SPLITA from meta, but 
> see that SPLITB has references and does nothing.
> 4. region B completes split
> 5. janitor wakes up, removes SPLITB, but see that there is no records for A 
> and does nothing again.
> Result - parent region hangs forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4799) Catalog Janitor logic bug causes region leackage

2011-11-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151362#comment-13151362
 ] 

Max Lapan commented on HBASE-4799:
--

This bug differs from HBASE-4238. 4238 is about regions splitting too fast, but 
4799 about regions splitting too slow.

> Catalog Janitor logic bug causes region leackage
> 
>
> Key: HBASE-4799
> URL: https://issues.apache.org/jira/browse/HBASE-4799
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
>Priority: Critical
> Attachments: 0001-Fix-of-Regions-Leaks-problem-in-janitor.patch, 
> 0002-Temporary-fix-to-remove-leaked-regions.patch
>
>
> When region split takes a significant amount of time, CatalogJanitor can 
> cleanup one of SPLIT records, but left another in META. When another split 
> finish, janitor cleans left SPLIT record, but parent regions haven't removed 
> from FS and META not cleared.
> The race condition is follows:
> 1. region split started
> 2. one of regions splitted, i.e. A (have no reference storefiles) but other 
> (B) doesn't
> 3. janitor started and in routine checkDaughter removes SPLITA from meta, but 
> see that SPLITB has references and does nothing.
> 4. region B completes split
> 5. janitor wakes up, removes SPLITB, but see that there is no records for A 
> and does nothing again.
> Result - parent region hangs forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4799) Catalog Janitor logic bug causes region leackage

2011-11-16 Thread Max Lapan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151356#comment-13151356
 ] 

Max Lapan commented on HBASE-4799:
--

Tried to attach files in reverse order, but form sorts them alphabetically, so 
not sure :).

We had this problem right after upgrade from 0.90.3-cdh3u1 to 0.90.4-cdh3u2 a 
month ago. I tested both patches today - they work. The main reason I separated 
them, is that 'temporary fix' is not needed in long term - it just tells 
janitor to remove leacked regions (we had about 15Tb of garbage on 50Tb table). 
And it could be dangerous in some situations, for example, when region started 
to split and RS crashed. So, only 0001 should be commited.

> Catalog Janitor logic bug causes region leackage
> 
>
> Key: HBASE-4799
> URL: https://issues.apache.org/jira/browse/HBASE-4799
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.4
>Reporter: Max Lapan
>Assignee: Max Lapan
>Priority: Critical
> Attachments: 0001-Fix-of-Regions-Leaks-problem-in-janitor.patch, 
> 0002-Temporary-fix-to-remove-leaked-regions.patch
>
>
> When region split takes a significant amount of time, CatalogJanitor can 
> cleanup one of SPLIT records, but left another in META. When another split 
> finish, janitor cleans left SPLIT record, but parent regions haven't removed 
> from FS and META not cleared.
> The race condition is follows:
> 1. region split started
> 2. one of regions splitted, i.e. A (have no reference storefiles) but other 
> (B) doesn't
> 3. janitor started and in routine checkDaughter removes SPLITA from meta, but 
> see that SPLITB has references and does nothing.
> 4. region B completes split
> 5. janitor wakes up, removes SPLITB, but see that there is no records for A 
> and does nothing again.
> Result - parent region hangs forever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira