[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625623#comment-13625623 ] Sergey Shelukhin commented on HBASE-5416: - what is this patch for in this JIRA? It's been closed months ago Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.94.5, 0.95.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-TestJoinedScanners-0.94.txt, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585131#comment-13585131 ] Ted Yu commented on HBASE-5416: --- I am with Lars on this one. The feature should be part of 0.94 bq. Ted, I think your approach will just make things more complicated going forward Another option is to drop the new method from Filter interface. Server side implementation depends on FilterBase which has the stub isFamilyEssential(). HRegion.RegionScannerImpl can use instanceof check which is fast. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585148#comment-13585148 ] Dave Latham commented on HBASE-5416: {quote} How hard is it to change filter to use FilterBase and replace it first? {quote} The change is very simple for us. It means we need to a wait a bit before deploying the hbase upgrade until we can upgrade our client apps first, though. This is what we've decided to do, so this incompatibility is not going to be a blocker for us, just a slight delay. {quote} I'd be interested in why you had to implement Filter directly rather than extending FilterBase. {quote} This particular Filter implementation was made as a wrapper around any other Filter as part of some experiments we were doing for more dynamic Filter classloading a couple years back. I don't think there was a FilterBase class at the time or we may have just chose to make it a generic Filter (or actually RowFilterInterface back then) to make sure it implements and wraps every method. I think leaving the method in FilterBase only for 0.94 would be a good move. However, it's a bit tricky since 0.94.5 has already been released. If the method is dropped from Filter in 0.94.6 then we're saying 0.94.6 is compatible with everything but 0.94.5. However if you were unfortunate enough to start on 0.94.5 and implement Filter directly then you're going to break again. Perhaps that's a rare enough case. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585151#comment-13585151 ] Ted Yu commented on HBASE-5416: --- @Dave: 0.94.5 was announced on 2013-02-16. If 0.94.6 is released within 10 days, the window of someone implementing 0.94.5 version of Filter interface is very short. In hindsight, we should have implemented this feature in 0.94 without touching Filter interface. We have a good lesson (for other interfaces). If you want to deploy 0.94.5 in the next few days, try not adding @Override in your Filter implementation. Again, thanks for reporting this - other HBase users would get benefit. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585167#comment-13585167 ] Ted Yu commented on HBASE-5416: --- testScanner_JoinedScanners passed as well: {code} Running org.apache.hadoop.hbase.regionserver.TestHRegion 2013-02-23 09:07:06.614 java[57714:1203] Unable to load realm info from SCDynamicStore Tests run: 73, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 44.586 sec {code} Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585211#comment-13585211 ] Lars Hofhansl commented on HBASE-5416: -- Moving the method to FilterBase is a great idea and a good compromise. Personally I think implementing Filters directly is rare and I would have a preference for keeping this in the interface since it is cleaner. isFamilyEssential is very useful for future scan performance enhancements, I would hate to see it vanish again from the Filter interface. (Incidentally in trunk Filter is now a class, which would have allowed us to make changes without this problem). As alternative can we add to the Javadoc of Filter a note to avoid implementing it directly and rather extend FilterBase? [~davelatham] If this is a hassle for you I think we're all in agreement that we should push the method down to FilterBase. [~stack] I think you'd prefer the push into FilterBase. Let's just do that. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585230#comment-13585230 ] Dave Latham commented on HBASE-5416: It's not going to make a difference for me any longer as we're planning to move forward with an application update then a 0.94.5 upgrade. However, it sounds like a good plan to move to FilterBase (in the 0.94 branch only) to preserve compatibility for anyone else who comes along. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585233#comment-13585233 ] stack commented on HBASE-5416: -- bq. It's not going to make a difference for me any longer as we're planning to move forward with an application update then a 0.94.5 upgrade. So, you don't need us back anything out? bq. What do you think of my proposal above (@23/Feb/13 06:11) ? Can't find what you are referring to [~ted_yu]. If i search I only see the above pointer. bq. Dave Latham If this is a hassle for you I think we're all in agreement that we should push the method down to FilterBase. For 0.94? If it means a better compatibility story (with a hiccup, i.e. we warn folks about prob. in 0.90.5 but its fixed in 0.90.6), then I'm for it. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585237#comment-13585237 ] Ted Yu commented on HBASE-5416: --- bq. If i search I only see the above pointer. I was referring to approach #1 which Lars said is too complicated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585242#comment-13585242 ] Ted Yu commented on HBASE-5416: --- Created HBASE-7920 to move the new method out of Filter interface. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585246#comment-13585246 ] Dave Latham commented on HBASE-5416: {quote}So, you don't need us back anything out?{quote} That's right, we're just going to work around it. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585281#comment-13585281 ] Lars Hofhansl commented on HBASE-5416: -- It feels a bit like an overreaction. Not many folks implement their own filters, of those not many implement Filter directly, there are workarounds, and for Dave it no longer matters. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584553#comment-13584553 ] Dave Latham commented on HBASE-5416: I have a class that directly implements the Filter interface. This change looks to me like it will prevent me from doing a rolling upgrade to 0.94.5 of region servers while my client is using this filter on scans because the filter will fail to implement the changed interface. Is that correct? Is that acceptable? Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584971#comment-13584971 ] Sergey Shelukhin commented on HBASE-5416: - Hmm, this is correct. I am not sure if this is acceptable, iirc I saw someone pondering that (on the mailing list?) but deciding that most people would use FilterBase, but I cannot find it now. How hard is it to change filter to use FilterBase and replace it first? [~lhofhansl] Do you have an opinion? Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585018#comment-13585018 ] stack commented on HBASE-5416: -- [~lhofhansl] Would suggest backing out this change if it breaks compatibility, especially if it breaks compatibility for our homies in SOMA, SF. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585021#comment-13585021 ] stack commented on HBASE-5416: -- [~lhofhansl] Thanks for adding it to trunk. [~shmuma] Any chance of a paragraph on your fancy new feature? If you draft it -- including the possibilities your patch enables -- I'll take care of getting it into the ref guide. Good stuff. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585041#comment-13585041 ] Ted Yu commented on HBASE-5416: --- The 0.94 patch did introduce subtle issue. But this feature is useful. See email thread entitled 'Co-Processor in scanning the HBase's Table' on mailing list. The cause seems to be the addition of a new method to Filter interface. Can we do the following ? 1. introduce new interface, say Filter2 (open to other names), where isFamilyEssential(byte[] name) is added 2. move isFamilyEssential(byte[] name) out of Filter interface 3. let FilterBase implement Filter2 4. declare filter field of RegionScannerImpl to be of type Filter2 Since 0.94.5 has been rolled out, it is another kind of regression if this feature is taken out. My two cents. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585048#comment-13585048 ] Lars Hofhansl commented on HBASE-5416: -- IMHO this is similar to the coprocessor changes we had made in some 0.94 point releases that also break coprocessors (unless they derive from classes like BaseRegionObserver). In fact our own Phoenix folks ran into issues with this. These are somewhat internal APIs and we should be able to change them... Although I admit Filters are more stable in terms of APIs than coprocessors. Still, I'd vote for keep this patch, unchanged. [~davelatham], I'd be interested in why you had to implement Filter directly rather than extending FilterBase. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585055#comment-13585055 ] stack commented on HBASE-5416: -- There is no argument that this a 'useful' feature. 'useful' is not good enough reason to break 'public' Interface. Why would we put any obstacle in the way of the group that is running the largest hbase deploy? Don't they have enough headache already w/o having to jump a gratuitous incompatibility hurdle Anyone even 'need' this feature in 0.94? Suggest removing it for 0.90.6 so our man Dave can just go there. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585058#comment-13585058 ] Ted Yu commented on HBASE-5416: --- [~davelatham], [~stack]: What do you think of my proposal above (@23/Feb/13 06:11) ? Would that allow Dave to get over the hurdle ? If you think so, I can open a new JIRA with a patch. Thanks Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585062#comment-13585062 ] Lars Hofhansl commented on HBASE-5416: -- A rolling upgrade is still possible if a stub isFamilyEssential(...) is added to the Filter implementation before the rolling upgrade. Anyway, I am not attached to this feature in 0.94. At the same time I do not want to cripple our ability to make some changes to these APIs. We have not frozen the coprocessor APIs and neither should we freeze the Filter APIs. What if somebody had implemented a coprocessor API that we had changed. In the past we have stated that we will change these (coprocessor) APIs. Ted, I think your approach will just make things more complicated going forward. And I'd prefer to either keep this or revert altogether. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585064#comment-13585064 ] Lars Hofhansl commented on HBASE-5416: -- One last comment :) The reason why I am arguing keeping this is that this is one of the few features that allows HBase to make use to of its columnar nature to speed up queries. HBase is not known for its scan performance and this is one features to point to where we allow HBase to not even look at another column family unless a filter is matched for potentially significant speedups. I was planning on extending this to other filters as well. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570945#comment-13570945 ] Hudson commented on HBASE-5416: --- Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/]) HBASE-5416 Improve performance of scans with some kind of filters. (Sergey Shelukhin) (Revision 1433195) Result = FAILURE larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/KeyValue.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/Scan.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/Filter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/FilterBase.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/FilterList.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueExcludeFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SkipFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/WhileMatchFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/filter/TestSingleColumnValueExcludeFilter.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553712#comment-13553712 ] Hudson commented on HBASE-5416: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #348 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/348/]) HBASE-7383 create integration test for HBASE-5416 (improving scan performance for certain filters) (Sergey) (Revision 1433224) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/LoadTestDataGenerator.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/LoadTestKVGenerator.java * /hbase/trunk/hbase-common/src/test/java/org/apache/hadoop/hbase/util/TestLoadTestKVGenerator.java * /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestLazyCfLoading.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/encoding/TestEncodedSeekers.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/LoadTestTool.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedAction.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedReader.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedWriter.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/RestartMetaTest.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554223#comment-13554223 ] Hudson commented on HBASE-5416: --- Integrated in HBase-0.94-security #95 (See [https://builds.apache.org/job/HBase-0.94-security/95/]) HBASE-5416 Improve performance of scans with some kind of filters. (Sergey Shelukhin) (Revision 1433195) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/KeyValue.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/Scan.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/Filter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/FilterBase.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/FilterList.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueExcludeFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SkipFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/WhileMatchFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/filter/TestSingleColumnValueExcludeFilter.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553167#comment-13553167 ] Sergey Shelukhin commented on HBASE-5416: - Can this be counted as +1 to commit? :) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553221#comment-13553221 ] Lars Hofhansl commented on HBASE-5416: -- Yes. I will commit this today. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553329#comment-13553329 ] Hudson commented on HBASE-5416: --- Integrated in HBase-0.94 #732 (See [https://builds.apache.org/job/HBase-0.94/732/]) HBASE-5416 Improve performance of scans with some kind of filters. (Sergey Shelukhin) (Revision 1433195) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/KeyValue.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/Scan.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/Filter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/FilterBase.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/FilterList.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueExcludeFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/SkipFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/filter/WhileMatchFilter.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/filter/TestSingleColumnValueExcludeFilter.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553406#comment-13553406 ] Hudson commented on HBASE-5416: --- Integrated in HBase-TRUNK #3745 (See [https://builds.apache.org/job/HBase-TRUNK/3745/]) HBASE-7383 create integration test for HBASE-5416 (improving scan performance for certain filters) (Sergey) (Revision 1433224) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/LoadTestDataGenerator.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/LoadTestKVGenerator.java * /hbase/trunk/hbase-common/src/test/java/org/apache/hadoop/hbase/util/TestLoadTestKVGenerator.java * /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestLazyCfLoading.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/encoding/TestEncodedSeekers.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/LoadTestTool.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedAction.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedReader.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/MultiThreadedWriter.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/RestartMetaTest.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551652#comment-13551652 ] Lars Hofhansl commented on HBASE-5416: -- I think I convinced myself that this is good to go for 0.94. Going forward this could be useful for all kinds of filters. I can see many scenarios where we want filters to be evaluated on selected CFs only and include the other CFs when the row is not filtered based on the former. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549955#comment-13549955 ] Sergey Shelukhin commented on HBASE-5416: - Is this JIRA unresolved pending 0.94 commit? Just checking as it shows up in my filter :) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550431#comment-13550431 ] Ted Yu commented on HBASE-5416: --- For 0.94 patch, I saw the following on my Mac: {code} testScanner_JoinedScannersWithLimits(org.apache.hadoop.hbase.regionserver.TestHRegion) Time elapsed: 0.001 sec FAILURE! junit.framework.AssertionFailedError: expected:3 but was:1 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.hbase.regionserver.TestHRegion.testScanner_JoinedScannersWithLimits(TestHRegion.java:2976) {code} Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550680#comment-13550680 ] Lars Hofhansl commented on HBASE-5416: -- I ran all 0.94 tests. They all pass on my machines. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550731#comment-13550731 ] Lars Hofhansl commented on HBASE-5416: -- Does this test fail consistently for you? Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550737#comment-13550737 ] Lars Hofhansl commented on HBASE-5416: -- On my machine at this test fails too in 0.94. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550772#comment-13550772 ] Lars Hofhansl commented on HBASE-5416: -- Found the problem. For the part of the patch that I had applied manually I mistook{{kv != KV_LIMIT}} for {{kv == KV_LIMIT}}. No idea how on earth the test on my server machine passed. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550794#comment-13550794 ] Ted Yu commented on HBASE-5416: --- Thanks Lars for the finding. I am running test suite based on patch v3. Will report back if there is any abnormality. I was looking for long lines. {code} + public static final String LOAD_CFS_ON_DEMAND_CONFIG_KEY = hbase.hregion.scan.loadColumnFamiliesOnDemand; {code} nit: wrap long line above. {code} + * @param heap KeyValueHeap to fetch data from. It must be positioned on correct row before call. {code} Long line: it would be 100 characters wide if the trailing period is removed. {code} + stopRow = nextKv == null || isStopRow(nextKv.getBuffer(), nextKv.getRowOffset(), nextKv.getRowLength()); {code} Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13550820#comment-13550820 ] Ted Yu commented on HBASE-5416: --- The following tests failed locally: TestMultiSlaveReplication,TestMasterReplication,TestZKLeaderManager The first two failed without the patch, too. I think patch v3 should be good to go. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0, 0.94.5 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548688#comment-13548688 ] Karthik Ranganathan commented on HBASE-5416: I think the specific description (of making filters apply to only some CF's) is a good idea.But we continue down this path of generalizing filters, it could lead to an explosion of ad-hoc filters. In that case, it might be better to expose more co-processor hooks. Overall, +1 (only skimmed the changes though). Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548713#comment-13548713 ] Ted Yu commented on HBASE-5416: --- Thanks for the review, Karthik. I will think about how co-processor hooks can be used to reduce changes in filters. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549006#comment-13549006 ] Lars Hofhansl commented on HBASE-5416: -- I don't necessarily think that ad hoc filters are bad. They are nice in that they are per store, can do skip scans, etc. They fill a different use case compare to coprocs. If anything, this might be an impetus to support filters better (load them dynamically like coprocs, maybe even invent a general filter descriptions, etc, etc). Since nobody has better API ideas I'm +1 on committing (both 0.94 and 0.96). Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549012#comment-13549012 ] Ted Yu commented on HBASE-5416: --- We have 4 +1's for this JIRA. It is time to integrate. I plan to do that in trunk by this evening. @Lars: I haven't run test suite for 0.94 patch, do you want to integrate to 0.94 branch ? Thanks Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549040#comment-13549040 ] Ted Yu commented on HBASE-5416: --- Integrated to trunk. Thanks for the patch, Max and Sergey. Thanks for the review, Stack, Lars, Ram and Karthik. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549091#comment-13549091 ] Hudson commented on HBASE-5416: --- Integrated in HBase-TRUNK #3716 (See [https://builds.apache.org/job/HBase-TRUNK/3716/]) HBASE-5416 Improve performance of scans with some kind of filters (Max Lapan and Sergey) (Revision 1431103) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java * /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java * /hbase/trunk/hbase-protocol/src/main/protobuf/Client.proto * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Scan.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/Filter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterBase.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterList.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterWrapper.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueExcludeFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/SkipFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/WhileMatchFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/filter/TestSingleColumnValueExcludeFilter.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestJoinedScanners.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549208#comment-13549208 ] Lars Hofhansl commented on HBASE-5416: -- Looks like the trunk patch was not changed since I made the 0.94 patch (except for wrapping the long line, etc). I'll commit in the next day or so (unless somebody objects) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549238#comment-13549238 ] Hudson commented on HBASE-5416: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #338 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/338/]) HBASE-5416 Improve performance of scans with some kind of filters (Max Lapan and Sergey) (Revision 1431103) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java * /hbase/trunk/hbase-protocol/src/main/java/org/apache/hadoop/hbase/protobuf/generated/ClientProtos.java * /hbase/trunk/hbase-protocol/src/main/protobuf/Client.proto * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Scan.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/Filter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterBase.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterList.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/FilterWrapper.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueExcludeFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/SkipFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/filter/WhileMatchFilter.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/protobuf/ProtobufUtil.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/filter/TestSingleColumnValueExcludeFilter.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestJoinedScanners.java Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547076#comment-13547076 ] Ted Yu commented on HBASE-5416: --- [~mikhail], [~karthik.ranga], [~kannanm]: Your opinion would be helpful. Thanks Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546430#comment-13546430 ] Sergey Shelukhin commented on HBASE-5416: - Btw, the test appears to pass. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544708#comment-13544708 ] Ted Yu commented on HBASE-5416: --- Thanks for the review, Ram. +1 from me too. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544746#comment-13544746 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563428/5416-v16.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransaction Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3876//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544755#comment-13544755 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- +1 from me. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544773#comment-13544773 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563434/5416-v16.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationWithCompression Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3877//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data,
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545182#comment-13545182 ] Ted Yu commented on HBASE-5416: --- I plan to integrate patch v16 Monday (the 7th) afternoon. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545200#comment-13545200 ] Lars Hofhansl commented on HBASE-5416: -- Can we rule out first that there are no more general approaches? For example: What about declaring the columns that will have filters applied to them in the scan object? Maybe there are more way to look at this. I would also feel better if some of the Facebooks folks took a look at this ([~mikhail], [~karthik.ranga], [~kannanm]). Generally I like the patch, and I think we should even have this in 0.94. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544619#comment-13544619 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- @Ted Went thro the patch.. Wrote some tests to see that things work fine. Anyway did not do any perf tests. Looks good to me. Thanks for doing this. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543138#comment-13543138 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- I can check on this tomorrow with the latest patch and the doubts i had earlier. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541415#comment-13541415 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562796/5416-v16.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 4 zombie test(s): at org.apache.hadoop.hbase.io.encoding.TestUpgradeFromHFileV1ToEncoding.testUpgrade(TestUpgradeFromHFileV1ToEncoding.java:83) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3783//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541420#comment-13541420 ] Ted Yu commented on HBASE-5416: --- From https://builds.apache.org/job/PreCommit-HBASE-Build/3783//console: {code} pool-1-thread-1 prio=10 tid=0x773bc800 nid=0x451 in Object.wait() [0x772fe000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on lt;0x803da7e8 (a org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread) at java.lang.Thread.join(Thread.java:1186) - locked lt;0x803da7e8 (a org.apache.hadoop.hbase.util.JVMClusterUtil$RegionServerThread) at java.lang.Thread.join(Thread.java:1239) at org.apache.hadoop.hbase.util.JVMClusterUtil.shutdown(JVMClusterUtil.java:245) at org.apache.hadoop.hbase.LocalHBaseCluster.shutdown(LocalHBaseCluster.java:430) at org.apache.hadoop.hbase.MiniHBaseCluster.shutdown(MiniHBaseCluster.java:501) at org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniHBaseCluster(HBaseTestingUtility.java:856) at org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniCluster(HBaseTestingUtility.java:826) at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.after(TestSplitTransactionOnCluster.java:109) {code} Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541422#comment-13541422 ] Ted Yu commented on HBASE-5416: --- Local run of tests flagged by QA script was successful: 172 mt -Dtest=TestRestartCluster,TestSplitTransactionOnCluster 173 mt -Dtest=TestUpgradeFromHFileV1ToEncoding Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541166#comment-13541166 ] Lars Hofhansl commented on HBASE-5416: -- I will try to make a 0.94 patch and do some performance testing, if that turns out well, let's pull this into 0.94 as well. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541172#comment-13541172 ] Lars Hofhansl commented on HBASE-5416: -- v13 is undoing some of the optimization I had put it. HRegion.RegionScannerImpl.isStopRow should continue to take a byte[], offset, and length, rather than a KeyValue. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541174#comment-13541174 ] Lars Hofhansl commented on HBASE-5416: -- Even when I fix that, it does slow down the tight loop case by about 10%. (That is a scan on a single column with a filter that happens to filter everything, and everything is in the blockcache) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541175#comment-13541175 ] Lars Hofhansl commented on HBASE-5416: -- This is probably due to the extra calls to KeyValueHeap.peek(). Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541178#comment-13541178 ] Lars Hofhansl commented on HBASE-5416: -- If the while loop in populateResults is change back to the original do/while loop the results are closer to what it was before. I understand why you changed it to the while loop (it looks better), but it does unnecessary peeks into the heap. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541179#comment-13541179 ] Lars Hofhansl commented on HBASE-5416: -- It's still 5.8s vs 6.2s (without vs with patch). Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541185#comment-13541185 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562737/5416-0.94-v1.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3772//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541191#comment-13541191 ] Lars Hofhansl commented on HBASE-5416: -- Here's another experiment I did against the 0.94 patch: Two column families, a SingleColumnValueFilter that filters everything based on the 1st CF. When I run this against 2 CFs (with on demand disabled) it takes about 60s. When I run this a single CF it takes ~30s. When I enable on demand loading and run against both CFs it still takes 60s. I would have expected that that would have been closer to 30s. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541200#comment-13541200 ] Ted Yu commented on HBASE-5416: --- Thanks for doing the experiments, Lars. Can you take a look at the use case shown in the test in the patch to see if it matches the way you did experiment ? Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541202#comment-13541202 ] Lars Hofhansl commented on HBASE-5416: -- There's something amiss here. Filer.filterRow is not called anywhere in trunk except for some filterwrappers and in tests (so we could just remove it, or there's a more general bug in trunk). It seems it is not used. SCVF does not implement filterRow(ListKeyValue) so this entire thing cannot work (I did these checks on the trunk patch). I assumed what I tested was the use case: You have multiple column families and use a SCVF based on one of the families, which should then avoid even touching the other CFs. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541237#comment-13541237 ] Lars Hofhansl commented on HBASE-5416: -- Thanks Ted. You're right. I'll redo my test in a bit. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541241#comment-13541241 ] Lars Hofhansl commented on HBASE-5416: -- Actually. That should make it work with my 0.94 patch. In 0.96 Filter.filterRow() is never called, so this it would never skip any rows that way. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541250#comment-13541250 ] Ted Yu commented on HBASE-5416: --- @Lars: Pardon me for not explaining in more detail. In HRegion.RegionScannerImpl(): {code} if (scan.hasFilter()) { this.filter = new FilterWrapper(scan.getFilter()); {code} In FilterWrapper: {code} public void filterRow(ListKeyValue kvs) { //To fix HBASE-6429, //Filter with filterRow() returning true is incompatible with scan with limit //1. hasFilterRow() returns true, if either filterRow() or filterRow(kvs) is implemented. //2. filterRow() is merged with filterRow(kvs), //so that to make all those row related filtering stuff in the same function. this.filter.filterRow(kvs); if (!kvs.isEmpty() this.filter.filterRow()) { kvs.clear(); } {code} If you have further questions, please let me know. Once you confirm that the 0.94 patch works, I will attach patch v14 for trunk which addresses the two issues you raised this afternoon. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541257#comment-13541257 ] Lars Hofhansl commented on HBASE-5416: -- Cool... You're right here too. We should probably merge these two in 0.94 as well, but that's another story. (It's all getting a bit messy. We have Filter, FilterBase, FilterWrapper. Now we even have logic in FilterWrapper) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541258#comment-13541258 ] Lars Hofhansl commented on HBASE-5416: -- I confirmed that when I set filterIfMissing the scan time is very close to 30s. Note that I am not giving a +1, yet. This needs a bit more looking at. Especially the slow down the regular path is a bit disconcerting. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541280#comment-13541280 ] Lars Hofhansl commented on HBASE-5416: -- It might help to inline the populateResults code again. That would save us an extra call to KeyValueHeap.peek(). Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541281#comment-13541281 ] Lars Hofhansl commented on HBASE-5416: -- Or populateResult could return the nextKv to indicate that we're not done (null to indicate that a limit was hit) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541286#comment-13541286 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562752/5416-v14.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication {color:red}-1 core zombie tests{color}. There are 2 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3778//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541291#comment-13541291 ] Lars Hofhansl commented on HBASE-5416: -- Lemme do the same to the 0.94 patch and try my test again. (If we want this in 0.94 we have to maintain these two patches separately now, as 0.94 and 0.96 are sufficiently different in this area. Maybe 0.96 is indeed better.) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541306#comment-13541306 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562766/5416-0.94-v2.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3781//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541307#comment-13541307 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562757/5416-v15.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestHCM org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.TestHBaseTestingUtility {color:red}-1 core zombie tests{color}. There are 3 zombie test(s): at org.apache.hadoop.hbase.io.encoding.TestUpgradeFromHFileV1ToEncoding.testUpgrade(TestUpgradeFromHFileV1ToEncoding.java:83) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3780//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540386#comment-13540386 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- @Ted You are correct. My bad. So if {code} if (this.filter == null || !scan.doLoadColumnFamiliesOnDemand() + || this.filter.isFamilyEssential(entry.getKey())) { {code} So even if first two condition fail if isFamilyEssential= true then it will add to scanners. But still if isFamilyEssentail = false then what will happen wrt the stopRow that is added in the latest patches. What you feel? Because the current is always got from the storeHeap. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540388#comment-13540388 ] Ted Yu commented on HBASE-5416: --- I need to go over the logic one more time. But for non-essential column family, I think it is correct too. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540054#comment-13540054 ] Ted Yu commented on HBASE-5416: --- bq. Now in this case i will have only the joinedScanner heap. I think you meant we only have storeHeap. Here is related code from HRegion.nextInternal() of 0.94 branch: {code} if (isStopRow(currentRow, offset, length)) { if (filter != null filter.hasFilterRow()) { filter.filterRow(results); } if (filter != null filter.filterRow()) { results.clear(); } return false; {code} We can see the logic is the same as that in patch v12. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540062#comment-13540062 ] Max Lapan commented on HBASE-5416: -- bq. I think you meant we only have storeHeap. No, exactly one KVS in joinedScanner heap and empty storeHeap. It was caused by {{!scan.doLoadColumnFamiliesOnDemand()}} extra condition in constructor. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540069#comment-13540069 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- Yes Max that's what i meant. Ted, just take the condition where isFamilyEssential = true. Also scan says lazy loading of CF is allowed. (and also only one CF). So storeHeap will be null in this case. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540070#comment-13540070 ] Ted Yu commented on HBASE-5416: --- Here is the change related to Max's comment above: {code} +if (this.filter == null || !scan.doLoadColumnFamiliesOnDemand() + || this.filter.isFamilyEssential(entry.getKey())) { + scanners.add(scanner); {code} Ram said: bq. Also scan says lazy loading of CF is allowed. So the condition {{!scan.doLoadColumnFamiliesOnDemand()}} should be false, right ? Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540078#comment-13540078 ] Ted Yu commented on HBASE-5416: --- Here is code from patch v7: {code} +if (this.filter == null || this.filter.isFamilyEssential(entry.getKey())) { + scanners.add(scanner); {code} In the above case, I don't see difference between v7 and v12 w.r.t. populating scanners List. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540105#comment-13540105 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562493/5416-v13.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.replication.TestReplication {color:red}-1 core zombie tests{color}. There are 4 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3718//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539286#comment-13539286 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- I could find one thing here {code} // Let's see what we have in the storeHeap. KeyValue current = this.storeHeap.peek(); boolean stopRow = isStopRow(current); {code} Assume i have one CF and i have a filter with isFamilyEssential = true. Also scan says lazy loading of CF is allowed. Now in this case i will have only the joinedScanner heap. So now when the next() is called current will be null as nothing is there in storeHeap. Inside isStopRow {code} return kv == null || kv.getBuffer() == null || (stopRow != null comparator.compareRows(stopRow, 0, stopRow.length, kv.getBuffer(), kv.getRowOffset(), kv.getRowLength()) = isScan); {code} This will give me true. So stopRow is true. {code} if (stopRow) { if (filter != null filter.hasFilterRow()) { filter.filterRow(results); } return false; } {code} So this code will just return without scanning. Pls correct me if am wrong? Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538236#comment-13538236 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- I will take a look at the updated patch by tomorrow evening. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538272#comment-13538272 ] stack commented on HBASE-5416: -- bq. It appears that I have accidentally fixed something. I can repro with v11 patch but not v12... Smile Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538441#comment-13538441 ] Ted Yu commented on HBASE-5416: --- Looks like ClientProtos.java needs to be regenerated due to recent changes in trunk. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538453#comment-13538453 ] Sergey Shelukhin commented on HBASE-5416: - I replaced v12 patch to do that yesterday evening. Or you mean again? I will regen at next iteration, just in case :) Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538563#comment-13538563 ] Ted Yu commented on HBASE-5416: --- I think the rewritten logic is easier to understand. {code} + public static final String LOAD_CFS_ON_DEMAND_CONFIG_KEY = hbase.hregion.scan.loadColumnFamiliesOnDemand; {code} Wrap long line. {code} + public boolean isLoadingCfsOnDemandDefault() { {code} Can the 'Default' be dropped from the method name ? We're interested in whether on demand loading is on. {code} + ListKeyValueScanner joinedScanners = new ArrayListKeyValueScanner(); {code} Should we check scan.doLoadColumnFamiliesOnDemand() first so that we don't allocate ArrayList if this feature is turned off ? {code} + * Fetches records with this row into result list, until next row or limit (if not -1). {code} 'this row' - 'currentRow' 'result list' - 'results list' {code} +// Check if we were getting data from the joinedHeap abd hit the limit. {code} 'abd' - 'and' {code} + // Techically, if we hit limits before on this row, we don't need this call. {code} Typo: Techically {code} + // Populating from the joined map was stopped by limits, populate some more. {code} 'joined map' - 'joined heap' {code} +// the case when SingleValueExcludeFilter is used. {code} SingleValueExcludeFilter - SingleColumnValueExcludeFilter Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537208#comment-13537208 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- Ok groked up the patch. {code} if (!scan.getAllowLazyCfLoading() || this.filter == null || this.filter.isFamilyEssential(entry.getKey())) { {code} Move the this.filter == null as first condition. Because when you don have filters then the entire joinedHeap is not going to used right? {code} correct_row = this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length)); {code} So here we move on to the KV just before the row we got in the current next() call? After this suppose due to limits it says that joinedHeapHasMoreData =true, now when the next call comes {code} else if (joinedHeapHasMoreData) { joinedHeapHasMoreData = populateResult(this.joinedHeap, limit, currentRow, offset, length, metric); return true; {code} I think we should get the return val from the populateResult and if it returns a false we may need to check if we have reached the stopRow or not right? Filters need not be checked anyway. So one thing is if i say in my Scan that i need LazyLoading but my filter is NOT of the type SCVF and the ones that implement isFamilyEssential then it goes thro normal flow. May be this we need to document clearly as user may think that setting that property is going to give him a better optimized scan. Reg, the TestHRegion testcases. Actually the testcases does not test the behaviour of joinedScanners. Is it intended? But the testcase names suggests it tests joinedScanners. I will leave it to other scan experts in deciding whether this can go in. Overall a very good improvment. Thanks to Max, Sergey and Ted. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537230#comment-13537230 ] ramkrishna.s.vasudevan commented on HBASE-5416: --- @Sergey The split related failure has to be investigated. Will try looking into the possible reason for failure. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537444#comment-13537444 ] Sergey Shelukhin commented on HBASE-5416: - Hmm... I am trying to clean up code a bit now and write comments. It seems that this patch shouldn't work with limits at all... in the big else clause, if we get false from populateResult on storeHeap, we'd go on to start getting stuff from joinedMap. Suppose stopRow is true e.g. storeHeap.peek() now points at the stop-row. Suppose now we hit the limit and set joinedHeapHasMoreData, and return true. On the next call, storeHeap is still pointing to stop-row, so we won't even reach else if (joinedHeapHasMoreData) condition (well, and if we did we'd populate nothing because matchingRow will always return false). Can someone please sanity check me? I'll see how to fix it. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537628#comment-13537628 ] Sergey Shelukhin commented on HBASE-5416: - It appears that I have accidentally fixed something. I can repro with v11 patch but not v12... Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537631#comment-13537631 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562021/HBASE-5416-v12.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 28 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor org.apache.hadoop.hbase.client.TestFromClientSide org.apache.hadoop.hbase.coprocessor.TestAggregateProtocol {color:red}-1 core zombie tests{color}. There are zombie tests. See build logs for details. Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3641//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners
[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537690#comment-13537690 ] Hadoop QA commented on HBASE-5416: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562033/HBASE-5416-v12.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 28 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication org.apache.hadoop.hbase.util.TestMergeTable {color:red}-1 core zombie tests{color}. There are zombie tests. See build logs for details. Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3644//console This message is automatically generated. Improve performance of scans with some kind of filters. --- Key: HBASE-5416 URL: https://issues.apache.org/jira/browse/HBASE-5416 Project: HBase Issue Type: Improvement Components: Filters, Performance, regionserver Affects Versions: 0.90.4 Reporter: Max Lapan Assignee: Sergey Shelukhin Fix For: 0.96.0 Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans