Re: [jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support

2010-08-06 Thread Gerrit van Vuuren
Hi,
Could you paste the error messages? I've run this locally and it works. I'll 
try and do this on a different machine to see what's wrong.


- Original Message -
From: Daniel Dai (JIRA) j...@apache.org
To: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org
Sent: Fri Aug 06 19:48:16 2010
Subject: [jira] Commented: (PIG-1526) HiveColumnarLoader Partitioning Support


[ 
https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896117#action_12896117
 ] 

Daniel Dai commented on PIG-1526:
-

Hi, Gerrit, 
Piggybank test TestHiveColumnarLoader, TestPathPartitionHelper and 
TestPathPartitioner fail. Can you take a look? I will temporary drop these test 
cases from trunk until it is fixed.

Thanks

 HiveColumnarLoader Partitioning Support
 ---

 Key: PIG-1526
 URL: https://issues.apache.org/jira/browse/PIG-1526
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1526-2.patch, PIG-1526.patch


 I've made allot improvements on the HiveColumnarLoader:
 - Added support for LoadMetadata and data path Partitioning 
 - Improved and simplefied column loading
 Data Path Partitioning:
 Hive stores partitions as folders like to 
 /mytable/partition1=[value]/partition2=[value]. That is the table mytable 
 contains 2 partitions [partition1, partition2].
 The HiveColumnarLoader will scan the inputpath /mytable and add to the 
 PigSchema the columns partition2 and partition2. 
 These columns can then be used in filtering. 
 For example: We've got year,month,day,hour partitions in our data uploads.
 So a table might look like mytable/year=2010/month=02/day=01.
 Loading with the HiveColumnarLoader allows our pig scripts do filter by date 
 using the standard pig Filter operator.
 I've added 2 classes for this:
 - PathPartitioner
 - PathPartitionHelper
 These classes are not hive dependent and could be used by any other loader 
 that wants to support partitioning and helps with implementing the 
 LoadMetadata interface.
 For this reason I though it best to put it into the package 
 org.apache.pig.piggybank.storage.partition.
 What would be nice is in the future have the PigStorage also use these 2 
 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: COMPLETED merge of load-store-redesign branch to trunk

2010-02-19 Thread Gerrit van Vuuren
Great stuff guys,

I've been keen on refactoring the pig HiveRCLoader reader and writer to use the 
new load-store redesign.

 

- Original Message -
From: Pradeep Kamath prade...@yahoo-inc.com
To: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org; 
pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org
Sent: Fri Feb 19 20:05:54 2010
Subject: COMPLETED merge of load-store-redesign branch to trunk 

The merge from load-store-redesign branch to trunk is now completed. New
commits can now proceed on trunk. The load-store-redesign branch is
deprecated with this merge and no more commits should be done on that
branch.

 

Pradeep

 



From: Pradeep Kamath 
Sent: Thursday, February 18, 2010 11:20 AM
To: Pradeep Kamath; 'pig-dev@hadoop.apache.org';
'pig-u...@hadoop.apache.org'
Subject: BEGINNING merge of load-store-redesign branch to trunk - hold
off commits!

 

Hi,

  I will begin this activity now - a request to all committers to not
commit to trunk or load-store-redesign till I send an all clear message
- I am anticipating this will hopefully be completed by end of day
(Pacific time) tomorrow.

 

Thanks,

Pradeep

 



From: Pradeep Kamath 
Sent: Tuesday, February 16, 2010 11:34 AM
To: 'pig-dev@hadoop.apache.org'; 'pig-u...@hadoop.apache.org'
Subject: Plan to merge load-store-redesign branch to trunk

 

Hi,

   We would like to merge the load-store-redesign branch to trunk
tentatively on Thursday. To do this, I would like to request all
committers to not commit anything to load-store-redesign branch or trunk
during the period of the merge. I will send out a mail to indicate begin
and end of this activity - tentatively I am expecting this to be a day's
period between 9 AM PST Thursday to 9AM PST Friday so I can resolve any
conflicts and run all tests.

 

Pradeep

 



RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-07 Thread Gerrit van Vuuren
Hi,

I would like to extend the HiveColumnarRC Reader in such a way that it can tell 
Pig to only use a certain group of files, i.e. I want to filter the files and 
have Pig only use these for calculating the amount of tasks to run. I'll 
appreciate if anybody can point me in the right direction.

Cheers,
 Gerrit

-Original Message-
From: Gerrit Jansen van Vuuren (JIRA) [mailto:j...@apache.org] 
Sent: 03 December 2009 16:03
To: pig-dev@hadoop.apache.org
Subject: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables


 [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerrit Jansen van Vuuren updated PIG-1117:
--

Attachment: HiveColumnarLoaderTest.patch
HiveColumnarLoader.patch

Pig Storage Loader for reading from HiveColumnarRC Files

 Pig reading hive columnar rc tables
 ---

 Key: PIG-1117
 URL: https://issues.apache.org/jira/browse/PIG-1117
 Project: Pig
  Issue Type: New Feature
Reporter: Gerrit Jansen van Vuuren
 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch


 I've coded a LoadFunc implementation that can read from Hive Columnar RC 
 tables, this is needed for a project that I'm working on because all our data 
 is stored using the Hive thrift serialized Columnar RC format. I have looked 
 at the piggy bank but did not find any implementation that could do this. 
 We've been running it on our cluster for the last week and have worked out 
 most bugs.
  
 There are still some improvements to be done but I would need  like setting 
 the amount of mappers based on date partitioning. Its been optimized so as to 
 read only specific columns and can churn through a data set almost 8 times 
 faster with this improvement because not all column data is read.
 I would like to contribute the class to the piggybank can you guide me in 
 what I need to do?
 I've used hive specific classes to implement this, is it possible to add this 
 to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

2009-12-03 Thread Gerrit van Vuuren
Hi,

I've made 2 patches one for the Loader and another is the Unit Test.
It's not perfect yet but atleast this way people can start testing it and give 
some inputs.

How do I submit the patch? I tried the SubmitPatch link but could not attach 
the actual patch, then just ended up attaching it as a file.

Note that to run this you'll need to hive  the hive_exec.jar from hive 
http://svn.apache.org/repos/asf/hadoop/hive/trunk/ql/

Any help on how to integrate this with the ant build.xml will be appreciated.

Cheers,
 Gerrit


Pig reading hive columnar rc tables

2009-11-30 Thread Gerrit van Vuuren
Hi,

 

I've coded a LoadFunc implementation that can read from Hive Columnar RC
tables, this is needed for a project that I'm working on because all our
data is stored using the Hive thrift serialized Columnar RC format. I
have looked at the piggy bank but did not find any implementation that
could do this. We've been running it on our cluster for the last week
and have worked out most bugs.

 

There are still some improvements to be done but I would need  like
setting the amount of mappers based on date partitioning. Its been
optimized so as to read only specific columns and can churn through a
data set almost 8 times faster with this improvement because not all
column data is read.

 

I would like to contribute the class to the piggybank can you guide me
in what I need to do?

I've used hive specific classes to implement this, is it possible to add
this to the piggy bank build ivy for automatic download of the
dependencies?

 

Thanks,

 Gerrit Jansen van Vuuren