Re: [jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support
Hi, Could you paste the error messages? I've run this locally and it works. I'll try and do this on a different machine to see what's wrong. - Original Message - From: Daniel Dai (JIRA) j...@apache.org To: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org Sent: Fri Aug 06 19:48:16 2010 Subject: [jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support [ https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896117#action_12896117 ] Daniel Dai commented on PIG-1526: - Hi, Gerrit, Piggybank test TestHiveColumnarLoader, TestPathPartitionHelper and TestPathPartitioner fail. Can you take a look? I will temporary drop these test cases from trunk until it is fixed. Thanks HiveColumnarLoader Partitioning Support --- Key: PIG-1526 URL: https://issues.apache.org/jira/browse/PIG-1526 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Priority: Minor Fix For: 0.8.0 Attachments: PIG-1526-2.patch, PIG-1526.patch I've made allot improvements on the HiveColumnarLoader: - Added support for LoadMetadata and data path Partitioning - Improved and simplefied column loading Data Path Partitioning: Hive stores partitions as folders like to /mytable/partition1=[value]/partition2=[value]. That is the table mytable contains 2 partitions [partition1, partition2]. The HiveColumnarLoader will scan the inputpath /mytable and add to the PigSchema the columns partition2 and partition2. These columns can then be used in filtering. For example: We've got year,month,day,hour partitions in our data uploads. So a table might look like mytable/year=2010/month=02/day=01. Loading with the HiveColumnarLoader allows our pig scripts do filter by date using the standard pig Filter operator. I've added 2 classes for this: - PathPartitioner - PathPartitionHelper These classes are not hive dependent and could be used by any other loader that wants to support partitioning and helps with implementing the LoadMetadata interface. For this reason I though it best to put it into the package org.apache.pig.piggybank.storage.partition. What would be nice is in the future have the PigStorage also use these 2 classes to provide automatic path partitioning support. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: COMPLETED merge of load-store-redesign branch to trunk
Great stuff guys, I've been keen on refactoring the pig HiveRCLoader reader and writer to use the new load-store redesign. - Original Message - From: Pradeep Kamath prade...@yahoo-inc.com To: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org; pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org Sent: Fri Feb 19 20:05:54 2010 Subject: COMPLETED merge of load-store-redesign branch to trunk The merge from load-store-redesign branch to trunk is now completed. New commits can now proceed on trunk. The load-store-redesign branch is deprecated with this merge and no more commits should be done on that branch. Pradeep From: Pradeep Kamath Sent: Thursday, February 18, 2010 11:20 AM To: Pradeep Kamath; 'pig-dev@hadoop.apache.org'; 'pig-u...@hadoop.apache.org' Subject: BEGINNING merge of load-store-redesign branch to trunk - hold off commits! Hi, I will begin this activity now - a request to all committers to not commit to trunk or load-store-redesign till I send an all clear message - I am anticipating this will hopefully be completed by end of day (Pacific time) tomorrow. Thanks, Pradeep From: Pradeep Kamath Sent: Tuesday, February 16, 2010 11:34 AM To: 'pig-dev@hadoop.apache.org'; 'pig-u...@hadoop.apache.org' Subject: Plan to merge load-store-redesign branch to trunk Hi, We would like to merge the load-store-redesign branch to trunk tentatively on Thursday. To do this, I would like to request all committers to not commit anything to load-store-redesign branch or trunk during the period of the merge. I will send out a mail to indicate begin and end of this activity - tentatively I am expecting this to be a day's period between 9 AM PST Thursday to 9AM PST Friday so I can resolve any conflicts and run all tests. Pradeep
RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables
Hi, I would like to extend the HiveColumnarRC Reader in such a way that it can tell Pig to only use a certain group of files, i.e. I want to filter the files and have Pig only use these for calculating the amount of tasks to run. I'll appreciate if anybody can point me in the right direction. Cheers, Gerrit -Original Message- From: Gerrit Jansen van Vuuren (JIRA) [mailto:j...@apache.org] Sent: 03 December 2009 16:03 To: pig-dev@hadoop.apache.org Subject: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables [ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerrit Jansen van Vuuren updated PIG-1117: -- Attachment: HiveColumnarLoaderTest.patch HiveColumnarLoader.patch Pig Storage Loader for reading from HiveColumnarRC Files Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Reporter: Gerrit Jansen van Vuuren Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables
Hi, I've made 2 patches one for the Loader and another is the Unit Test. It's not perfect yet but atleast this way people can start testing it and give some inputs. How do I submit the patch? I tried the SubmitPatch link but could not attach the actual patch, then just ended up attaching it as a file. Note that to run this you'll need to hive the hive_exec.jar from hive http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/ Any help on how to integrate this with the ant build.xml will be appreciated. Cheers, Gerrit
Pig reading hive columnar rc tables
Hi, I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? Thanks, Gerrit Jansen van Vuuren