Re: Pig Performance Benchmarks

2009-02-17 Thread Alan Gates
That's correct.  The 10m in the names weren't really meant to be  
hardcoded into the patch, as the idea is that the tables could be  
created at different sizes depending on your cluster size.  Sorry for  
the incomplete state of things, obviously that patch needs some work  
before I commit it.


Alan.

On Feb 13, 2009, at 11:09 PM, Ashutosh Chauhan wrote:


Hi Alan & Others,

I am using pigmix patch at:
https://issues.apache.org/jira/browse/PIG-200 and want to generate
test data and run pigmix queries on it. As I understand, shell scripts
in the patch are intended to generate data for pigmix queries.
I have been able to adapt the shell scripts, map-reduce jobs and
pigmix queries on our cluster environment. Faced few problems because
of hard-coded paths, but resolved most issues. Still having one
confusion though. I believe there is one to one correspondence between
test data files generated by shell script and files loaded by pig
queries. So, wanted to verify if that is the case. According to my
understanding, correspondence is as follows:

generate_data.sh pigmix
=
page_views  -> pages10m
widerow   -> widerow1m
power_users-> power_users, power_users10m (either
could be used? )
users  ->  users, users10m
(either could be used? )

Is my understanding correct? Since data generated is random, could not
verify manually by checking schema inside files.

Thanks,
Ashutosh




Re: Pig performance

2008-12-31 Thread Alan Gates
This will definitely be done after the merge of types to trunk.  As  
for PIG-273, the changes we need to make are larger than just that.   
Consider, for example:


A = load ...
B = filter ...
store B into 'bla';
C = group B by $0;
...

There's no split explicitly in there, but pig should be able to tee  
the input at the 'store B' and keep going.  So PIG-273 is part of it,  
but I imagine when we start working on it there'll be another JIRA to  
track all the changes, of which PIG-273 will become a sub-task.


Alan.

On Dec 30, 2008, at 12:48 AM, Kevin Weil wrote:


Hi Olga,

I am eagerly awaiting not having to re-read all data each time I  
store part
of a split!  As far as timelines go, I imagine this will be a  
larger fix

that will come in after the merge from types -> trunk?  And is
Pig-273<https://issues.apache.org/jira/browse/PIG-273>the proper bug
for tracking this issue?

Thanks,
Kevin

On Mon, Dec 22, 2008 at 10:22 AM, Olga Natkovich inc.com>wrote:



The reason trunk does not contain the latest code is that Pig has
undergone a complete redesign that we could not do incrementally  
on the

trunk without jeopardizing its stability. The decision was made to do
the work on a brunch and then merge branch code to the trunk when  
it is

stable.

The merging will be happening in the early January.

The second comment that Alan made is that we are about to start  
work on

cross query optimization - ability to combine computations across
multiple stores.

Olga


-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Saturday, December 20, 2008 10:33 AM
To: pig-dev@hadoop.apache.org
Cc: pig-dev@hadoop.apache.org
Subject: Re: Pig performance


I think the key points that Alan brought up in his blog
comment were that trunk pig is paradoxically not the most
current and that storing intermediate results can decrease
the scope of optimizations.

On Dec 20, 2008, at 10:16, Alan Gates  wrote:


I left a comment on the blog addressing some of the issues

he brought

up.

Alan.

On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:


Hey Pig team,

Did anyone check out the recent claims about Pig's poor

performance

versus Cascading? Though I haven't worked extensively with either
system, I found the statements made fairly bold and am curious to
hear more about their validity from the Pig development team:

http://www.manamplified.org/archives/2008/12/cascading-and-pig- 
planne

rs.html
.

Thanks,
Jeff










Re: Pig performance

2008-12-30 Thread Kevin Weil
Hi Olga,

I am eagerly awaiting not having to re-read all data each time I store part
of a split!  As far as timelines go, I imagine this will be a larger fix
that will come in after the merge from types -> trunk?  And is
Pig-273<https://issues.apache.org/jira/browse/PIG-273>the proper bug
for tracking this issue?

Thanks,
Kevin

On Mon, Dec 22, 2008 at 10:22 AM, Olga Natkovich wrote:

> The reason trunk does not contain the latest code is that Pig has
> undergone a complete redesign that we could not do incrementally on the
> trunk without jeopardizing its stability. The decision was made to do
> the work on a brunch and then merge branch code to the trunk when it is
> stable.
>
> The merging will be happening in the early January.
>
> The second comment that Alan made is that we are about to start work on
> cross query optimization - ability to combine computations across
> multiple stores.
>
> Olga
>
> > -Original Message-
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: Saturday, December 20, 2008 10:33 AM
> > To: pig-dev@hadoop.apache.org
> > Cc: pig-dev@hadoop.apache.org
> > Subject: Re: Pig performance
> >
> >
> > I think the key points that Alan brought up in his blog
> > comment were that trunk pig is paradoxically not the most
> > current and that storing intermediate results can decrease
> > the scope of optimizations.
> >
> > On Dec 20, 2008, at 10:16, Alan Gates  wrote:
> >
> > > I left a comment on the blog addressing some of the issues
> > he brought
> > > up.
> > >
> > > Alan.
> > >
> > > On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:
> > >
> > >> Hey Pig team,
> > >>
> > >> Did anyone check out the recent claims about Pig's poor
> > performance
> > >> versus Cascading? Though I haven't worked extensively with either
> > >> system, I found the statements made fairly bold and am curious to
> > >> hear more about their validity from the Pig development team:
> > >>
> > http://www.manamplified.org/archives/2008/12/cascading-and-pig-planne
> > >> rs.html
> > >> .
> > >>
> > >> Thanks,
> > >> Jeff
> > >
> >
>


RE: Pig performance

2008-12-22 Thread Olga Natkovich
The reason trunk does not contain the latest code is that Pig has
undergone a complete redesign that we could not do incrementally on the
trunk without jeopardizing its stability. The decision was made to do
the work on a brunch and then merge branch code to the trunk when it is
stable.

The merging will be happening in the early January.

The second comment that Alan made is that we are about to start work on
cross query optimization - ability to combine computations across
multiple stores.

Olga 

> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
> Sent: Saturday, December 20, 2008 10:33 AM
> To: pig-dev@hadoop.apache.org
> Cc: pig-dev@hadoop.apache.org
> Subject: Re: Pig performance
> 
> 
> I think the key points that Alan brought up in his blog 
> comment were that trunk pig is paradoxically not the most 
> current and that storing intermediate results can decrease 
> the scope of optimizations.
> 
> On Dec 20, 2008, at 10:16, Alan Gates  wrote:
> 
> > I left a comment on the blog addressing some of the issues 
> he brought 
> > up.
> >
> > Alan.
> >
> > On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:
> >
> >> Hey Pig team,
> >>
> >> Did anyone check out the recent claims about Pig's poor 
> performance 
> >> versus Cascading? Though I haven't worked extensively with either 
> >> system, I found the statements made fairly bold and am curious to 
> >> hear more about their validity from the Pig development team:
> >> 
> http://www.manamplified.org/archives/2008/12/cascading-and-pig-planne
> >> rs.html
> >> .
> >>
> >> Thanks,
> >> Jeff
> >
> 


Re: Pig performance

2008-12-20 Thread Ted Dunning


I think the key points that Alan brought up in his blog comment were  
that trunk pig is paradoxically not the most current and that storing  
intermediate results can decrease the scope of optimizations.


On Dec 20, 2008, at 10:16, Alan Gates  wrote:

I left a comment on the blog addressing some of the issues he  
brought up.


Alan.

On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:


Hey Pig team,

Did anyone check out the recent claims about Pig's poor performance  
versus
Cascading? Though I haven't worked extensively with either system,  
I found
the statements made fairly bold and am curious to hear more about  
their

validity from the Pig development team:
http://www.manamplified.org/archives/2008/12/cascading-and-pig-planners.html
.

Thanks,
Jeff




Re: Pig performance

2008-12-20 Thread Alan Gates
I left a comment on the blog addressing some of the issues he brought  
up.


Alan.

On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:


Hey Pig team,

Did anyone check out the recent claims about Pig's poor performance  
versus
Cascading? Though I haven't worked extensively with either system,  
I found
the statements made fairly bold and am curious to hear more about  
their

validity from the Pig development team:
http://www.manamplified.org/archives/2008/12/cascading-and-pig- 
planners.html

.

Thanks,
Jeff