Re: Pig Performance Benchmarks
That's correct. The 10m in the names weren't really meant to be hardcoded into the patch, as the idea is that the tables could be created at different sizes depending on your cluster size. Sorry for the incomplete state of things, obviously that patch needs some work before I commit it. Alan. On Feb 13, 2009, at 11:09 PM, Ashutosh Chauhan wrote: Hi Alan & Others, I am using pigmix patch at: https://issues.apache.org/jira/browse/PIG-200 and want to generate test data and run pigmix queries on it. As I understand, shell scripts in the patch are intended to generate data for pigmix queries. I have been able to adapt the shell scripts, map-reduce jobs and pigmix queries on our cluster environment. Faced few problems because of hard-coded paths, but resolved most issues. Still having one confusion though. I believe there is one to one correspondence between test data files generated by shell script and files loaded by pig queries. So, wanted to verify if that is the case. According to my understanding, correspondence is as follows: generate_data.sh pigmix = page_views -> pages10m widerow -> widerow1m power_users-> power_users, power_users10m (either could be used? ) users -> users, users10m (either could be used? ) Is my understanding correct? Since data generated is random, could not verify manually by checking schema inside files. Thanks, Ashutosh
Re: Pig performance
This will definitely be done after the merge of types to trunk. As for PIG-273, the changes we need to make are larger than just that. Consider, for example: A = load ... B = filter ... store B into 'bla'; C = group B by $0; ... There's no split explicitly in there, but pig should be able to tee the input at the 'store B' and keep going. So PIG-273 is part of it, but I imagine when we start working on it there'll be another JIRA to track all the changes, of which PIG-273 will become a sub-task. Alan. On Dec 30, 2008, at 12:48 AM, Kevin Weil wrote: Hi Olga, I am eagerly awaiting not having to re-read all data each time I store part of a split! As far as timelines go, I imagine this will be a larger fix that will come in after the merge from types -> trunk? And is Pig-273<https://issues.apache.org/jira/browse/PIG-273>the proper bug for tracking this issue? Thanks, Kevin On Mon, Dec 22, 2008 at 10:22 AM, Olga Natkovich inc.com>wrote: The reason trunk does not contain the latest code is that Pig has undergone a complete redesign that we could not do incrementally on the trunk without jeopardizing its stability. The decision was made to do the work on a brunch and then merge branch code to the trunk when it is stable. The merging will be happening in the early January. The second comment that Alan made is that we are about to start work on cross query optimization - ability to combine computations across multiple stores. Olga -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Saturday, December 20, 2008 10:33 AM To: pig-dev@hadoop.apache.org Cc: pig-dev@hadoop.apache.org Subject: Re: Pig performance I think the key points that Alan brought up in his blog comment were that trunk pig is paradoxically not the most current and that storing intermediate results can decrease the scope of optimizations. On Dec 20, 2008, at 10:16, Alan Gates wrote: I left a comment on the blog addressing some of the issues he brought up. Alan. On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote: Hey Pig team, Did anyone check out the recent claims about Pig's poor performance versus Cascading? Though I haven't worked extensively with either system, I found the statements made fairly bold and am curious to hear more about their validity from the Pig development team: http://www.manamplified.org/archives/2008/12/cascading-and-pig- planne rs.html . Thanks, Jeff
Re: Pig performance
Hi Olga, I am eagerly awaiting not having to re-read all data each time I store part of a split! As far as timelines go, I imagine this will be a larger fix that will come in after the merge from types -> trunk? And is Pig-273<https://issues.apache.org/jira/browse/PIG-273>the proper bug for tracking this issue? Thanks, Kevin On Mon, Dec 22, 2008 at 10:22 AM, Olga Natkovich wrote: > The reason trunk does not contain the latest code is that Pig has > undergone a complete redesign that we could not do incrementally on the > trunk without jeopardizing its stability. The decision was made to do > the work on a brunch and then merge branch code to the trunk when it is > stable. > > The merging will be happening in the early January. > > The second comment that Alan made is that we are about to start work on > cross query optimization - ability to combine computations across > multiple stores. > > Olga > > > -Original Message- > > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > > Sent: Saturday, December 20, 2008 10:33 AM > > To: pig-dev@hadoop.apache.org > > Cc: pig-dev@hadoop.apache.org > > Subject: Re: Pig performance > > > > > > I think the key points that Alan brought up in his blog > > comment were that trunk pig is paradoxically not the most > > current and that storing intermediate results can decrease > > the scope of optimizations. > > > > On Dec 20, 2008, at 10:16, Alan Gates wrote: > > > > > I left a comment on the blog addressing some of the issues > > he brought > > > up. > > > > > > Alan. > > > > > > On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote: > > > > > >> Hey Pig team, > > >> > > >> Did anyone check out the recent claims about Pig's poor > > performance > > >> versus Cascading? Though I haven't worked extensively with either > > >> system, I found the statements made fairly bold and am curious to > > >> hear more about their validity from the Pig development team: > > >> > > http://www.manamplified.org/archives/2008/12/cascading-and-pig-planne > > >> rs.html > > >> . > > >> > > >> Thanks, > > >> Jeff > > > > > >
RE: Pig performance
The reason trunk does not contain the latest code is that Pig has undergone a complete redesign that we could not do incrementally on the trunk without jeopardizing its stability. The decision was made to do the work on a brunch and then merge branch code to the trunk when it is stable. The merging will be happening in the early January. The second comment that Alan made is that we are about to start work on cross query optimization - ability to combine computations across multiple stores. Olga > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Saturday, December 20, 2008 10:33 AM > To: pig-dev@hadoop.apache.org > Cc: pig-dev@hadoop.apache.org > Subject: Re: Pig performance > > > I think the key points that Alan brought up in his blog > comment were that trunk pig is paradoxically not the most > current and that storing intermediate results can decrease > the scope of optimizations. > > On Dec 20, 2008, at 10:16, Alan Gates wrote: > > > I left a comment on the blog addressing some of the issues > he brought > > up. > > > > Alan. > > > > On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote: > > > >> Hey Pig team, > >> > >> Did anyone check out the recent claims about Pig's poor > performance > >> versus Cascading? Though I haven't worked extensively with either > >> system, I found the statements made fairly bold and am curious to > >> hear more about their validity from the Pig development team: > >> > http://www.manamplified.org/archives/2008/12/cascading-and-pig-planne > >> rs.html > >> . > >> > >> Thanks, > >> Jeff > > >
Re: Pig performance
I think the key points that Alan brought up in his blog comment were that trunk pig is paradoxically not the most current and that storing intermediate results can decrease the scope of optimizations. On Dec 20, 2008, at 10:16, Alan Gates wrote: I left a comment on the blog addressing some of the issues he brought up. Alan. On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote: Hey Pig team, Did anyone check out the recent claims about Pig's poor performance versus Cascading? Though I haven't worked extensively with either system, I found the statements made fairly bold and am curious to hear more about their validity from the Pig development team: http://www.manamplified.org/archives/2008/12/cascading-and-pig-planners.html . Thanks, Jeff
Re: Pig performance
I left a comment on the blog addressing some of the issues he brought up. Alan. On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote: Hey Pig team, Did anyone check out the recent claims about Pig's poor performance versus Cascading? Though I haven't worked extensively with either system, I found the statements made fairly bold and am curious to hear more about their validity from the Pig development team: http://www.manamplified.org/archives/2008/12/cascading-and-pig- planners.html . Thanks, Jeff