Re: Pig reading hive columnar rc tables

2009-11-30 Thread Dmitriy Ryaboy
I retract the suggestion :).
How would we do testing/building for it in piggybank? Not include it in the
compile and test targets, and set up a separate compile-rcstore,
test-rcstore targets?

-D


On Mon, Nov 30, 2009 at 6:31 PM, Olga Natkovich  wrote:

> +1 on what Alan is saying. I think it would be an overkill to have
> another contrib. for this.
>
> Olga
>
> -Original Message-
> From: Alan Gates [mailto:ga...@yahoo-inc.com]
> Sent: Monday, November 30, 2009 2:42 PM
> To: pig-dev@hadoop.apache.org
> Subject: Re: Pig reading hive columnar rc tables
>
>
> On Nov 30, 2009, at 12:18 PM, Dmitriy Ryaboy wrote:
>
> > That's awesome, I've been itching to do that but never got around to
> > it..
> > Garrit, do you have any benchmarks on read speeds?
> >
> > I don't know about putting this in piggybank, as it carries with it
> > pretty
> > significant dependencies, increasing the size of the jar and making it
> > difficult for users to don't need it to build piggybank in the first
> > place.
> > We might want to consider some other contrib for it -- maybe a "misc"
> > contrib that would have indivudual ant targets for these kinds of
> > compatibility submissions?
>
> Does it have to increase the size of the piggybank jar?  Instead of
> including hive in our piggybank jar, which I agree would be bad, can
> we just say that if you want to use this function you need to provide
> the appropriate hive jar yourself?  This way we could use ivy to pull
> the jars and build piggybank.
>
> I'm not really wild about creating a new section of contrib just for
> functions that have heavier weight requirements.
>
> Alan.
>
> >
> > -D
> >
> >
> > On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich  > inc.com> wrote:
> >
> >> Hi Garrit,
> >>
> >> It would be great if you could contribute the code. The process is
> >> pretty simple:
> >>
> >> - Open a JIRA that describes what the loader does and that you would
> >> like to contribute it to the Piggybank.
> >> - Submit the patch that contains the loader. Make sure it has unit
> >> tests
> >> and javadoc.
> >>
> >> On this is done, one of the committers will review and commit the
> >> patch.
> >>
> >> More details on how to contribute are in
> >> http://wiki.apache.org/pig/PiggyBank.
> >>
> >> Olga
> >>
> >> -Original Message-
> >> From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com]
> >> Sent: Friday, November 27, 2009 2:42 AM
> >> To: pig-dev@hadoop.apache.org
> >> Subject: Pig reading hive columnar rc tables
> >>
> >> Hi,
> >>
> >>
> >>
> >> I've coded a LoadFunc implementation that can read from Hive
> >> Columnar RC
> >> tables, this is needed for a project that I'm working on because
> >> all our
> >> data is stored using the Hive thrift serialized Columnar RC format. I
> >> have looked at the piggy bank but did not find any implementation
> >> that
> >> could do this. We've been running it on our cluster for the last week
> >> and have worked out most bugs.
> >>
> >>
> >>
> >> There are still some improvements to be done but I would need  like
> >> setting the amount of mappers based on date partitioning. Its been
> >> optimized so as to read only specific columns and can churn through a
> >> data set almost 8 times faster with this improvement because not all
> >> column data is read.
> >>
> >>
> >>
> >> I would like to contribute the class to the piggybank can you guide
> >> me
> >> in what I need to do?
> >>
> >> I've used hive specific classes to implement this, is it possible
> >> to add
> >> this to the piggy bank build ivy for automatic download of the
> >> dependencies?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Gerrit Jansen van Vuuren
> >>
> >>
>
>


RE: Pig reading hive columnar rc tables

2009-11-30 Thread Olga Natkovich
+1 on what Alan is saying. I think it would be an overkill to have
another contrib. for this.

Olga

-Original Message-
From: Alan Gates [mailto:ga...@yahoo-inc.com] 
Sent: Monday, November 30, 2009 2:42 PM
To: pig-dev@hadoop.apache.org
Subject: Re: Pig reading hive columnar rc tables


On Nov 30, 2009, at 12:18 PM, Dmitriy Ryaboy wrote:

> That's awesome, I've been itching to do that but never got around to  
> it..
> Garrit, do you have any benchmarks on read speeds?
>
> I don't know about putting this in piggybank, as it carries with it  
> pretty
> significant dependencies, increasing the size of the jar and making it
> difficult for users to don't need it to build piggybank in the first  
> place.
> We might want to consider some other contrib for it -- maybe a "misc"
> contrib that would have indivudual ant targets for these kinds of
> compatibility submissions?

Does it have to increase the size of the piggybank jar?  Instead of  
including hive in our piggybank jar, which I agree would be bad, can  
we just say that if you want to use this function you need to provide  
the appropriate hive jar yourself?  This way we could use ivy to pull  
the jars and build piggybank.

I'm not really wild about creating a new section of contrib just for  
functions that have heavier weight requirements.

Alan.

>
> -D
>
>
> On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich  inc.com> wrote:
>
>> Hi Garrit,
>>
>> It would be great if you could contribute the code. The process is
>> pretty simple:
>>
>> - Open a JIRA that describes what the loader does and that you would
>> like to contribute it to the Piggybank.
>> - Submit the patch that contains the loader. Make sure it has unit  
>> tests
>> and javadoc.
>>
>> On this is done, one of the committers will review and commit the  
>> patch.
>>
>> More details on how to contribute are in
>> http://wiki.apache.org/pig/PiggyBank.
>>
>> Olga
>>
>> -Original Message-
>> From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com]
>> Sent: Friday, November 27, 2009 2:42 AM
>> To: pig-dev@hadoop.apache.org
>> Subject: Pig reading hive columnar rc tables
>>
>> Hi,
>>
>>
>>
>> I've coded a LoadFunc implementation that can read from Hive  
>> Columnar RC
>> tables, this is needed for a project that I'm working on because  
>> all our
>> data is stored using the Hive thrift serialized Columnar RC format. I
>> have looked at the piggy bank but did not find any implementation  
>> that
>> could do this. We've been running it on our cluster for the last week
>> and have worked out most bugs.
>>
>>
>>
>> There are still some improvements to be done but I would need  like
>> setting the amount of mappers based on date partitioning. Its been
>> optimized so as to read only specific columns and can churn through a
>> data set almost 8 times faster with this improvement because not all
>> column data is read.
>>
>>
>>
>> I would like to contribute the class to the piggybank can you guide  
>> me
>> in what I need to do?
>>
>> I've used hive specific classes to implement this, is it possible  
>> to add
>> this to the piggy bank build ivy for automatic download of the
>> dependencies?
>>
>>
>>
>> Thanks,
>>
>> Gerrit Jansen van Vuuren
>>
>>



Re: Pig reading hive columnar rc tables

2009-11-30 Thread Alan Gates


On Nov 30, 2009, at 12:18 PM, Dmitriy Ryaboy wrote:

That's awesome, I've been itching to do that but never got around to  
it..

Garrit, do you have any benchmarks on read speeds?

I don't know about putting this in piggybank, as it carries with it  
pretty

significant dependencies, increasing the size of the jar and making it
difficult for users to don't need it to build piggybank in the first  
place.

We might want to consider some other contrib for it -- maybe a "misc"
contrib that would have indivudual ant targets for these kinds of
compatibility submissions?


Does it have to increase the size of the piggybank jar?  Instead of  
including hive in our piggybank jar, which I agree would be bad, can  
we just say that if you want to use this function you need to provide  
the appropriate hive jar yourself?  This way we could use ivy to pull  
the jars and build piggybank.


I'm not really wild about creating a new section of contrib just for  
functions that have heavier weight requirements.


Alan.



-D


On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich inc.com> wrote:



Hi Garrit,

It would be great if you could contribute the code. The process is
pretty simple:

- Open a JIRA that describes what the loader does and that you would
like to contribute it to the Piggybank.
- Submit the patch that contains the loader. Make sure it has unit  
tests

and javadoc.

On this is done, one of the committers will review and commit the  
patch.


More details on how to contribute are in
http://wiki.apache.org/pig/PiggyBank.

Olga

-Original Message-
From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com]
Sent: Friday, November 27, 2009 2:42 AM
To: pig-dev@hadoop.apache.org
Subject: Pig reading hive columnar rc tables

Hi,



I've coded a LoadFunc implementation that can read from Hive  
Columnar RC
tables, this is needed for a project that I'm working on because  
all our

data is stored using the Hive thrift serialized Columnar RC format. I
have looked at the piggy bank but did not find any implementation  
that

could do this. We've been running it on our cluster for the last week
and have worked out most bugs.



There are still some improvements to be done but I would need  like
setting the amount of mappers based on date partitioning. Its been
optimized so as to read only specific columns and can churn through a
data set almost 8 times faster with this improvement because not all
column data is read.



I would like to contribute the class to the piggybank can you guide  
me

in what I need to do?

I've used hive specific classes to implement this, is it possible  
to add

this to the piggy bank build ivy for automatic download of the
dependencies?



Thanks,

Gerrit Jansen van Vuuren






Re: Pig reading hive columnar rc tables

2009-11-30 Thread Yongqiang He
Hi Gerrit Jansen van Vuuren,

You can first open a jira on Pig, and people can discuss it there. Create an
account if u do not have one, and create an issue on
https://issues.apache.org/jira/browse/pig.

Thanks,
Yongqiang
On 11/27/09 2:41 AM, "Gerrit van Vuuren" 
wrote:

> Hi,
> 
>  
> 
> I've coded a LoadFunc implementation that can read from Hive Columnar RC
> tables, this is needed for a project that I'm working on because all our
> data is stored using the Hive thrift serialized Columnar RC format. I
> have looked at the piggy bank but did not find any implementation that
> could do this. We've been running it on our cluster for the last week
> and have worked out most bugs.
> 
>  
> 
> There are still some improvements to be done but I would need  like
> setting the amount of mappers based on date partitioning. Its been
> optimized so as to read only specific columns and can churn through a
> data set almost 8 times faster with this improvement because not all
> column data is read.
> 
>  
> 
> I would like to contribute the class to the piggybank can you guide me
> in what I need to do?
> 
> I've used hive specific classes to implement this, is it possible to add
> this to the piggy bank build ivy for automatic download of the
> dependencies?
> 
>  
> 
> Thanks,
> 
>  Gerrit Jansen van Vuuren
> 




Re: Pig reading hive columnar rc tables

2009-11-30 Thread Dmitriy Ryaboy
Sorry I misspelled your name, Gerrit.

-D

On Mon, Nov 30, 2009 at 3:18 PM, Dmitriy Ryaboy  wrote:

> That's awesome, I've been itching to do that but never got around to it..
> Garrit, do you have any benchmarks on read speeds?
>
> I don't know about putting this in piggybank, as it carries with it pretty
> significant dependencies, increasing the size of the jar and making it
> difficult for users to don't need it to build piggybank in the first place.
> We might want to consider some other contrib for it -- maybe a "misc"
> contrib that would have indivudual ant targets for these kinds of
> compatibility submissions?
>
> -D
>
>
> On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich wrote:
>
>> Hi Garrit,
>>
>> It would be great if you could contribute the code. The process is
>> pretty simple:
>>
>> - Open a JIRA that describes what the loader does and that you would
>> like to contribute it to the Piggybank.
>> - Submit the patch that contains the loader. Make sure it has unit tests
>> and javadoc.
>>
>> On this is done, one of the committers will review and commit the patch.
>>
>> More details on how to contribute are in
>> http://wiki.apache.org/pig/PiggyBank.
>>
>> Olga
>>
>> -Original Message-
>> From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com]
>> Sent: Friday, November 27, 2009 2:42 AM
>> To: pig-dev@hadoop.apache.org
>> Subject: Pig reading hive columnar rc tables
>>
>> Hi,
>>
>>
>>
>> I've coded a LoadFunc implementation that can read from Hive Columnar RC
>> tables, this is needed for a project that I'm working on because all our
>> data is stored using the Hive thrift serialized Columnar RC format. I
>> have looked at the piggy bank but did not find any implementation that
>> could do this. We've been running it on our cluster for the last week
>> and have worked out most bugs.
>>
>>
>>
>> There are still some improvements to be done but I would need  like
>> setting the amount of mappers based on date partitioning. Its been
>> optimized so as to read only specific columns and can churn through a
>> data set almost 8 times faster with this improvement because not all
>> column data is read.
>>
>>
>>
>> I would like to contribute the class to the piggybank can you guide me
>> in what I need to do?
>>
>> I've used hive specific classes to implement this, is it possible to add
>> this to the piggy bank build ivy for automatic download of the
>> dependencies?
>>
>>
>>
>> Thanks,
>>
>>  Gerrit Jansen van Vuuren
>>
>>
>


Re: Pig reading hive columnar rc tables

2009-11-30 Thread Dmitriy Ryaboy
That's awesome, I've been itching to do that but never got around to it..
Garrit, do you have any benchmarks on read speeds?

I don't know about putting this in piggybank, as it carries with it pretty
significant dependencies, increasing the size of the jar and making it
difficult for users to don't need it to build piggybank in the first place.
We might want to consider some other contrib for it -- maybe a "misc"
contrib that would have indivudual ant targets for these kinds of
compatibility submissions?

-D


On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich  wrote:

> Hi Garrit,
>
> It would be great if you could contribute the code. The process is
> pretty simple:
>
> - Open a JIRA that describes what the loader does and that you would
> like to contribute it to the Piggybank.
> - Submit the patch that contains the loader. Make sure it has unit tests
> and javadoc.
>
> On this is done, one of the committers will review and commit the patch.
>
> More details on how to contribute are in
> http://wiki.apache.org/pig/PiggyBank.
>
> Olga
>
> -Original Message-
> From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com]
> Sent: Friday, November 27, 2009 2:42 AM
> To: pig-dev@hadoop.apache.org
> Subject: Pig reading hive columnar rc tables
>
> Hi,
>
>
>
> I've coded a LoadFunc implementation that can read from Hive Columnar RC
> tables, this is needed for a project that I'm working on because all our
> data is stored using the Hive thrift serialized Columnar RC format. I
> have looked at the piggy bank but did not find any implementation that
> could do this. We've been running it on our cluster for the last week
> and have worked out most bugs.
>
>
>
> There are still some improvements to be done but I would need  like
> setting the amount of mappers based on date partitioning. Its been
> optimized so as to read only specific columns and can churn through a
> data set almost 8 times faster with this improvement because not all
> column data is read.
>
>
>
> I would like to contribute the class to the piggybank can you guide me
> in what I need to do?
>
> I've used hive specific classes to implement this, is it possible to add
> this to the piggy bank build ivy for automatic download of the
> dependencies?
>
>
>
> Thanks,
>
>  Gerrit Jansen van Vuuren
>
>


RE: Pig reading hive columnar rc tables

2009-11-30 Thread Olga Natkovich
Hi Garrit,

It would be great if you could contribute the code. The process is
pretty simple:

- Open a JIRA that describes what the loader does and that you would
like to contribute it to the Piggybank.
- Submit the patch that contains the loader. Make sure it has unit tests
and javadoc.

On this is done, one of the committers will review and commit the patch.

More details on how to contribute are in
http://wiki.apache.org/pig/PiggyBank.

Olga

-Original Message-
From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com] 
Sent: Friday, November 27, 2009 2:42 AM
To: pig-dev@hadoop.apache.org
Subject: Pig reading hive columnar rc tables

Hi,

 

I've coded a LoadFunc implementation that can read from Hive Columnar RC
tables, this is needed for a project that I'm working on because all our
data is stored using the Hive thrift serialized Columnar RC format. I
have looked at the piggy bank but did not find any implementation that
could do this. We've been running it on our cluster for the last week
and have worked out most bugs.

 

There are still some improvements to be done but I would need  like
setting the amount of mappers based on date partitioning. Its been
optimized so as to read only specific columns and can churn through a
data set almost 8 times faster with this improvement because not all
column data is read.

 

I would like to contribute the class to the piggybank can you guide me
in what I need to do?

I've used hive specific classes to implement this, is it possible to add
this to the piggy bank build ivy for automatic download of the
dependencies?

 

Thanks,

 Gerrit Jansen van Vuuren