[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247444#comment-13247444 ] Eran Kutner commented on HBASE-3996: @stack: I believe the only open issue in the review board is your suggestion to replace my MultiTableInputCollection with a ListScan. Although I agree it would make the patch simpler and allow it to have one less class, I think it will make using it less natural. Developers will have to create a Scan which is a common object and then set a table attribute. This feels less natural to me than setting the table by adding to a collection the way I've done it, but I guess it's a matter of perspective. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Assignee: Eran Kutner Fix For: 0.96.0 Attachments: 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, 3996-v5.txt, 3996-v6.txt, 3996-v7.txt, HBase-3996.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247448#comment-13247448 ] Eran Kutner commented on HBASE-3996: Just to give better reasoning why I feel it is unnatural. With my method someone using this functionality for the first time would be able to figure it out just by looking at the class names and interface definitions (using IDE auto completion for example), while the only way to know it is required to set that attribute is to dig in the documentation. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Assignee: Eran Kutner Fix For: 0.96.0 Attachments: 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, 3996-v5.txt, 3996-v6.txt, 3996-v7.txt, HBase-3996.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240335#comment-13240335 ] Eran Kutner commented on HBASE-3996: There is one pending change I know about, and that is making TableInputConf a static inner class. As for versionning I'll look at it but can't say when. Other than that I'm waiting to hear back from @Lars regarding my response to his suggestions on reusing TableInputFormatBase. Sorry for being slow to respond, I'm very busy with other things these days, so feel free to make any changes you feel are right. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Assignee: Eran Kutner Fix For: 0.96.0 Attachments: 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, 3996-v5.txt, HBase-3996.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235029#comment-13235029 ] Eran Kutner commented on HBASE-3996: Made some changes following @stack review. DOn't know how to submit for review again. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Assignee: Eran Kutner Fix For: 0.96.0 Attachments: 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, 3996-v5.txt, HBase-3996.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233274#comment-13233274 ] Eran Kutner commented on HBASE-3996: Sorry for missing all the action, I was offline for a couple of days. Thanks Ted and everyone else for pushing this forward. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Assignee: Eran Kutner Fix For: 0.94.0, 0.96.0 Attachments: 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, HBase-3996.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209228#comment-13209228 ] Eran Kutner commented on HBASE-3996: It was merging fine when I posted it about 7 months ago. I assume a lot has changed in TRUNK since. I'll take a look at it but can't promise a ETA. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Fix For: 0.94.0 Attachments: MultiTableInputFormat.patch, TestMultiTableInputFormat.java.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209310#comment-13209310 ] Eran Kutner commented on HBASE-3996: I now remember this was a patch file I tried to manipulate manually to reduce some extra stuff that was included and Stack didn't like. I regenerated the patch file from TRUNK, but it still have some unnecessary stuff in it. Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Fix For: 0.94.0 Attachments: HBase-3996.patch, MultiTableInputFormat.patch, TestMultiTableInputFormat.java.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4612) Allow ColumnPrefixFilter to support multiple prefixes
[ https://issues.apache.org/jira/browse/HBASE-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133586#comment-13133586 ] Eran Kutner commented on HBASE-4612: OK, I uploaded a patch for trunk, hopefully what I've done with the createFilterFromArguments method makes sense. Allow ColumnPrefixFilter to support multiple prefixes - Key: HBASE-4612 URL: https://issues.apache.org/jira/browse/HBASE-4612 Project: HBase Issue Type: Improvement Components: filters Affects Versions: 0.90.4 Reporter: Eran Kutner Assignee: Eran Kutner Priority: Minor Fix For: 0.94.0 Attachments: HBASE-4612-0.90.patch, HBASE-4612.patch When having a lot of columns grouped by name I've found that it would be very useful to be able to scan them using multiple prefixes, allowing to fetch specific groups in one scan, without fetching the entire row. This is impossible to achieve using a FilterList, so I've added such support to the existing ColmnPrefixFilter while keeping backward compatibility. The attached patch is based on 0.90.4, I noticed that the 0.92 branch has a new method to support instantiating filters using Thrift. I'm not sure how the serialization works there so I didn't implement that, but the rest of my code should work in 0.92 as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4612) Allow ColumnPrefixFilter to support multiple prefixes
[ https://issues.apache.org/jira/browse/HBASE-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13130020#comment-13130020 ] Eran Kutner commented on HBASE-4612: Hi Jonathan, thanks for the feedback! See answers inline: {quote}There's no explanation of the behavior anywhere. In the constructors and addPrefix() methods, you should document that this creates an OR condition across all of the prefixes, correct?{quote} - good point, added some more explanations. {quote}No need to instantiate a new comparator all the time (use Bytes.BYTES_COMPARATOR){quote} - Didn't know it existed. Changed. {quote}Something seems odd when you keep adding to the end of a List and then sort. How about a TreeSet? You can easily ignore dupes that way.{quote} - This is intentional. Sorting is done only during initialization but accessing a ArrayList, which is actually based on an array, is much more efficient than accessing a tree, so I sacrifice the aesthetics of the code for better runtime performance. {quote}There's no input verification so, for example, you could pass a null to the constructor or an empty byte[][] and have some strange behavior. Like it will instantiate okay but then you'll get server-side NPEs or IOOB.{quote} - it's a good point but I've looked and no other filter is validating its input either. I can throw a InvalidArgumentException but don't know if it's a good idea considering it's not the norm. {quote}this.prefixes.size() == 0 - this.prefixes.isEmpty(){quote} - ok, changed. {quote}your comment at the top of filterColumn, i wouldn't exactly call it a workaround, but it's a good comment. looking at the logic, it seems like correct behavior would be that it can be called with current == size() but it would be a bug if current size(), right? should you add an assert or throw an exception?{quote} - well it is kind of a workaround, because as an individual filter I expect not be called again after returning NEXT_ROW, however, when used with FilterList the filter does get called again which puts it in an ilegal state, so it has to explicitly handle that case. That is also why it can't throw an exception in that scenario, because it seems to be happening normally when used with FilterList. as for current it has to be smaller than size() or it would be outside the bounds of the array. Allow ColumnPrefixFilter to support multiple prefixes - Key: HBASE-4612 URL: https://issues.apache.org/jira/browse/HBASE-4612 Project: HBase Issue Type: Improvement Components: filters Affects Versions: 0.90.4 Reporter: Eran Kutner Assignee: Eran Kutner Priority: Minor Fix For: 0.94.0 Attachments: HBASE-4612-0.90.patch When having a lot of columns grouped by name I've found that it would be very useful to be able to scan them using multiple prefixes, allowing to fetch specific groups in one scan, without fetching the entire row. This is impossible to achieve using a FilterList, so I've added such support to the existing ColmnPrefixFilter while keeping backward compatibility. The attached patch is based on 0.90.4, I noticed that the 0.92 branch has a new method to support instantiating filters using Thrift. I'm not sure how the serialization works there so I didn't implement that, but the rest of my code should work in 0.92 as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4612) Allow ColumnPrefixFilter to support multiple prefixes
[ https://issues.apache.org/jira/browse/HBASE-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13130023#comment-13130023 ] Eran Kutner commented on HBASE-4612: @Ted: {quote}Improvements go to TRUNK.{quote} I know but see my initial comment regarding the new Thrift initialization method, I'm just not sure how it's supposed to work or what am I supposed to do there. Allow ColumnPrefixFilter to support multiple prefixes - Key: HBASE-4612 URL: https://issues.apache.org/jira/browse/HBASE-4612 Project: HBase Issue Type: Improvement Components: filters Affects Versions: 0.90.4 Reporter: Eran Kutner Assignee: Eran Kutner Priority: Minor Fix For: 0.94.0 Attachments: HBASE-4612-0.90.patch When having a lot of columns grouped by name I've found that it would be very useful to be able to scan them using multiple prefixes, allowing to fetch specific groups in one scan, without fetching the entire row. This is impossible to achieve using a FilterList, so I've added such support to the existing ColmnPrefixFilter while keeping backward compatibility. The attached patch is based on 0.90.4, I noticed that the 0.92 branch has a new method to support instantiating filters using Thrift. I'm not sure how the serialization works there so I didn't implement that, but the rest of my code should work in 0.92 as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira