[jira] Commented: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901279#action_12901279 ] Daniel Dai commented on PIG-1514: - One minor correction, adding: {code} currentPlan.remove(limit); {code} to OptimizeLimit:197 Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Pig to become a top level Apache project
With 9 +1 votes and no -1s the vote passes. I will begin a vote on Hadoop general. Alan. On Aug 18, 2010, at 10:34 AM, Alan Gates wrote: Earlier this week I began a discussion on Pig becoming a TLP (http://bit.ly/byD7L8 ). All of the received feedback was positive. So, let's have a formal vote. I propose we move Pig to a top level Apache project. I propose that the initial PMC of this project be the list of all currently active Pig committers (http://hadoop.apache.org/pig/whoweare.html ) as of 18 August 2010. I nominate Olga Natkovich as the chair of the PMC. (PMC chairs have no more power than other PMC members, but they are responsible for writing regular reports for the Apache board, assigning rights to new committers, etc.) I propose that as part of the resolution that will be forwarded to the Apache board we include that one of the first tasks of the new Pig PMC will be to adopt bylaws for the governance of the project. Alan. P.S. If this vote passes, the next step is that the proposal will be forwarded to the Hadoop PMC for discussion and vote. If the Hadoop PMC vote passes, a formal resolution is then drafted (see http://bit.ly/bvOTRq for an example resolution) and sent to the Apache board. The Apache board will then vote on whether to make Pig a TLP.
Re: August Pig contributor workshop
Olga, We do have another couple of spots. -Dmitriy On Thu, Aug 19, 2010 at 10:28 AM, Olga Natkovich ol...@yahoo-inc.comwrote: Dmitry, Do you have any spots left? Olga -Original Message- From: Russell Jurney [mailto:russell.jur...@gmail.com] Sent: Thursday, August 19, 2010 5:22 AM To: pig-dev@hadoop.apache.org Subject: Re: August Pig contributor workshop Oh, +2 more - Pete Skomoroch and Sam Shah will also attend, for a total of 4 LinkedIners. On Wed, Aug 18, 2010 at 9:18 PM, Alan Gates ga...@yahoo-inc.com wrote: Confirming Olga and I will be there. Alan. On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote: Hi folks, Please do RSVP so that we know how many people are coming. Thanks, -Dmitriy On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com wrote: All, We will be holding the next Pig contributor workshop at Twitter on Wednesday, August 25 from 4-6. The tentative agenda is to discuss: Making Piggybank better Pig and Azkaban integration Plans for features in 0.9 An update on the Howl project Anyone contributing to or interested i
[jira] Created: (PIG-1556) Need a clean way to kill Pig jobs.
Need a clean way to kill Pig jobs. -- Key: PIG-1556 URL: https://issues.apache.org/jira/browse/PIG-1556 Project: Pig Issue Type: New Feature Components: tools Affects Versions: 0.7.0 Reporter: Aravind Srinivasan Fix For: 0.9.0 We need a way to kill a running Pig script cleanly. This is very similar to hadoop job -kill command. This requirement means the following. 1) Support a pig -kill script ID or a similar syntax. The script ID or some unique handle should be easily available for the user to identify a running Pig job. 2) The command will then identify all the MR jobs that are currently spawned by this given Pig script. 3) It will internally usse hadoop job -kill to kill each one of those MR jobs spawned. 4) It will do any other cleanup necessary and also make sure all mappers/reducers emanating from this Pig script are killed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Attachment: PIG-1178-7.patch PIG-1178-7.patch switch the flag to use new logical plan by default. It fix most unit test except: 1. TestMultiQuery.testMultiQueryJiraPig1169, it depends on PIG-1514, will be fixed automatically once PIG-1514 check in 2. TestPruneColumn.testMapKey3 Both test cases are temporarily commented out. All other unit tests pass. Here is test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 36 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Status: Open (was: Patch Available) LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1178: Status: Patch Available (was: Open) LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with
[ https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901543#action_12901543 ] Daniel Dai commented on PIG-1178: - PIG-1178-7.patch committed. LogicalPlan and Optimizer are too complex and hard to work with --- Key: PIG-1178 URL: https://issues.apache.org/jira/browse/PIG-1178 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Daniel Dai Fix For: 0.8.0 Attachments: expressions-2.patch, expressions.patch, lp.patch, lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch, PIG-1178-7.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch The current implementation of the logical plan and the logical optimizer in Pig has proven to not be easily extensible. Developer feedback has indicated that adding new rules to the optimizer is quite burdensome. In addition, the logical plan has been an area of numerous bugs, many of which have been difficult to fix. Developers also feel that the logical plan is difficult to understand and maintain. The root cause for these issues is that a number of design decisions that were made as part of the 0.2 rewrite of the front end have now proven to be sub-optimal. The heart of this proposal is to revisit a number of those proposals and rebuild the logical plan with a simpler design that will make it much easier to maintain the logical plan as well as extend the logical optimizer. See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-506: -- Attachment: PIG-506.patch New patch address my comments. test-patch results - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 10 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 433 release audit warnings (more than the trunk's current 425 warnings). release audit warnings are for the javadoc html files I will commit once all unit tests pass. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.patch, TestWordCount.jar Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901556#action_12901556 ] Alan Gates commented on PIG-1555: - +1 If you have a chance sometime I'd be curious to learn the performance characteristics of this versus PigStorage. I'm curious if there is substantial cost to dealing with escaping. [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6
[ https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901559#action_12901559 ] Alan Gates commented on PIG-1508: - Alright, I'll get this checked in before we branch for 0.8 then. Make 'docs' target (forrest) work with Java 1.6 --- Key: PIG-1508 URL: https://issues.apache.org/jira/browse/PIG-1508 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Carl Steinbach Assignee: Carl Steinbach Attachments: PIG-1508.patch.txt FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with Java 1.6 The same ticket also suggests a workaround: disabling sitemap and stylesheet validation by setting the forrest.validate.sitemap and forrest.validate.stylesheets properties to false. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-908) Need a way to correlate MR jobs with Pig statements
[ https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-908: --- With Pig 0.8.0 we print a summary of the execution that contains (among other things) how aliases mapped to jobs. Example: JobId MapsReduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201004271216_12712 1 1 3 3 3 12 12 12 B,C GROUP_BY,COMBINER job_201004271216_12713 1 1 3 3 3 12 12 12 D SAMPLER job_201004271216_12714 1 1 3 3 3 12 12 12 D ORDER_BY,COMBINER hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp743703298/tmp-2019944040, Need a way to correlate MR jobs with Pig statements --- Key: PIG-908 URL: https://issues.apache.org/jira/browse/PIG-908 Project: Pig Issue Type: Wish Reporter: Dmitriy V. Ryaboy Assignee: Richard Ding Fix For: 0.8.0 Complex Pig Scripts often generate many Map-Reduce jobs, especially with the recent introduction of multi-store capabilities. For example, the first script in the Pig tutorial produces 5 MR jobs. There is currently very little support for debugging resulting jobs; if one of the MR jobs fails, it is hard to figure out which part of the script it was responsible for. Explain plans help, but even with the explain plan, a fair amount of effort (and sometimes, experimentation) is required to correlate the failing MR job with the corresponding PigLatin statements. This ticket is created to discuss approaches to alleviating this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1488) Make HDFS temp dir configurable
[ https://issues.apache.org/jira/browse/PIG-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1488: Release Note: Pig stores intermediate data generated between MR jobs in a temp location on HDFS. In Pig 0.8.0 this location is configurable by using pig.temp.dir property. The default is /tmp which is the same as hardcoded location in Pig 0.7.0 and earlier versions Make HDFS temp dir configurable --- Key: PIG-1488 URL: https://issues.apache.org/jira/browse/PIG-1488 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.8.0 Currently it is hardcoded to /tmp. It should be made into a property. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Release Note: Pig now supports running scripts and registering jars that are stored in HDFS, Amazon S3, or other distributed file systems. (was: Pig now supports running scripts and registering jars that are stored in HDFS, Amazon S3, or other distributed file systems. Also added a -R parameter which allows users to specify properties in key=value form on the command line.) Remove -R option. In 0.8 Pig supports generic parameters such as -Dkey=value. support jars and scripts in dfs --- Key: PIG-1505 URL: https://issues.apache.org/jira/browse/PIG-1505 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Fix For: 0.8.0 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, pig-jars-and-scripts-from-dfs-trunk-1.patch, pig-jars-and-scripts-from-dfs-trunk-2.patch, pig-jars-and-scripts-from-dfs-trunk.patch Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1484: Release Note: In Pig 0.7.0 only a single location is supported as input to BinStorage. (This location can be a file, a directory or a glob). With Pig 0.8.0 we are making BinSTorage (similar to PigStorage) support a list of locations. Example: a = load '1.bin,2.bin' using BinStorage(); BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag
[ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1447: --- Status: Patch Available (was: Open) Patch for increasing default value to 20%. No new test cases as this only changes the memory limit default. All core tests pass. Result of test-patch - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Tune memory usage of InternalCachedBag -- Key: PIG-1447 URL: https://issues.apache.org/jira/browse/PIG-1447 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch We need to find a better value for pig.cachedbag.memusage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1557) couple of issue mapping aliases to jobs
couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: August Pig contributor workshop
Ok, thanks Dmitry we have at least one more person coming with us. Olga -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Monday, August 23, 2010 10:02 AM To: pig-dev@hadoop.apache.org Subject: Re: August Pig contributor workshop Olga, We do have another couple of spots. -Dmitriy On Thu, Aug 19, 2010 at 10:28 AM, Olga Natkovich ol...@yahoo-inc.comwrote: Dmitry, Do you have any spots left? Olga -Original Message- From: Russell Jurney [mailto:russell.jur...@gmail.com] Sent: Thursday, August 19, 2010 5:22 AM To: pig-dev@hadoop.apache.org Subject: Re: August Pig contributor workshop Oh, +2 more - Pete Skomoroch and Sam Shah will also attend, for a total of 4 LinkedIners. On Wed, Aug 18, 2010 at 9:18 PM, Alan Gates ga...@yahoo-inc.com wrote: Confirming Olga and I will be there. Alan. On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote: Hi folks, Please do RSVP so that we know how many people are coming. Thanks, -Dmitriy On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com wrote: All, We will be holding the next Pig contributor workshop at Twitter on Wednesday, August 25 from 4-6. The tentative agenda is to discuss: Making Piggybank better Pig and Azkaban integration Plans for features in 0.9 An update on the Howl project Anyone contributing to or interested i
[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag
[ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901576#action_12901576 ] Olga Natkovich commented on PIG-1447: - This is probably the smallest patch I have reviewed recently :). +1 Tune memory usage of InternalCachedBag -- Key: PIG-1447 URL: https://issues.apache.org/jira/browse/PIG-1447 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch We need to find a better value for pig.cachedbag.memusage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901577#action_12901577 ] Olga Natkovich commented on PIG-1354: - Dmitry, Could you add release notes on how to use this? UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag
[ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1447: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk. Tune memory usage of InternalCachedBag -- Key: PIG-1447 URL: https://issues.apache.org/jira/browse/PIG-1447 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch We need to find a better value for pig.cachedbag.memusage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901584#action_12901584 ] Dmitriy V. Ryaboy commented on PIG-1354: Olga, There is a follow-up ticket here: https://issues.apache.org/jira/browse/PIG-1551 If that gets committed, I have a pretty detailed explanation of how to use the stuff in http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/ (happy to put the link in release notes, or just paste the whole post). UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901585#action_12901585 ] Olga Natkovich commented on PIG-1354: - Sounds good, Dmitry. Richard will review and commit the patch and then please paste the release notes. UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6
[ https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901586#action_12901586 ] Alan Gates commented on PIG-1508: - I can't figure out a way to test test-patch.sh without checking it in. And, if this does break something it will make life hard for developers who are trying to get their patches in before the 0.8 branch is cut. So, I propose that I hold off checking this in until we have all other pre-0.8 patches checked in. Then I'll check it in and do extensive testing with test-patch. That way I can quickly fix any issues I find and not disrupt others. Than we can branch for 0.8. Seem reasonable? As a side note, we still need Java 1.5 for forrest in the site docs. This patch only claims to fix it for the docs target, which it does. I'll open a separate JIRA to fix it on the site side, as it would be really nice to not force people to have 2 versions of Java to build Pig stuff. Make 'docs' target (forrest) work with Java 1.6 --- Key: PIG-1508 URL: https://issues.apache.org/jira/browse/PIG-1508 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Carl Steinbach Assignee: Carl Steinbach Attachments: PIG-1508.patch.txt FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with Java 1.6 The same ticket also suggests a workaround: disabling sitemap and stylesheet validation by setting the forrest.validate.sitemap and forrest.validate.stylesheets properties to false. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability
[ https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901587#action_12901587 ] Olga Natkovich commented on PIG-1311: - +1, please, commit Pig interfaces should be clearly classified in terms of scope and stability --- Key: PIG-1311 URL: https://issues.apache.org/jira/browse/PIG-1311 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.8.0 Attachments: PIG-1311.patch Clearly marking Pig interfaces (Java interfaces but also things like config files, CLIs, Pig Latin syntax and semantics, etc.) to show scope (public/private) and stability (stable/evolving/unstable) will help users understand how to interact with Pig and developers to understand what things they can and cannot change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1558) build.xml for site directory does not work
build.xml for site directory does not work -- Key: PIG-1558 URL: https://issues.apache.org/jira/browse/PIG-1558 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Going to the site directory and running ant produces: {code} ant Buildfile: build.xml clean: [delete] Deleting directory /Users/gates/src/pig/apache/site/author/build update: BUILD FAILED /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: java.io.IOException: Cannot run program forrest (in directory /Users/gates/src/pig/apache/site/author): error=2, No such file or directory {code} Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1552) Nested describe failed when the alias is not referred in the first foreach inner plan
[ https://issues.apache.org/jira/browse/PIG-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901593#action_12901593 ] Daniel Dai commented on PIG-1552: - Unit test pass. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Nested describe failed when the alias is not referred in the first foreach inner plan - Key: PIG-1552 URL: https://issues.apache.org/jira/browse/PIG-1552 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1552-1.patch The following script fail: {code} A = load 'studentab10k' as (name, age, gpa); B = group A by name; C = foreach B { D = distinct A.age; generate group, COUNT(D); } describe C::D; {code} If we remove group from generate statement, then it works -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1552) Nested describe failed when the alias is not referred in the first foreach inner plan
[ https://issues.apache.org/jira/browse/PIG-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1552: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed. Nested describe failed when the alias is not referred in the first foreach inner plan - Key: PIG-1552 URL: https://issues.apache.org/jira/browse/PIG-1552 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1552-1.patch The following script fail: {code} A = load 'studentab10k' as (name, age, gpa); B = group A by name; C = foreach B { D = distinct A.age; generate group, COUNT(D); } describe C::D; {code} If we remove group from generate statement, then it works -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901600#action_12901600 ] Richard Ding commented on PIG-1518: --- +1. The patch looks good. A few of minor points: * In PigSplit, the method add(InputSplit split) is not used and can be removed * In MapRedUtil, it would be better to not leave the debug verification code in the source code * In PigRecordReader, the code can be simplified if the initNextRecordReader() from constructor to initialize() method multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1558) build.xml for site directory does not work
[ https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1558: Attachment: PIG-1558.patch Attached patch makes it so that the ant invocation requires the user to specify the location of forrest. Also, the validation phase of forrest is disabled so that Java 1.6 can be used. Removal of the validation phase does not seem to impact creation of the web pages. build.xml for site directory does not work -- Key: PIG-1558 URL: https://issues.apache.org/jira/browse/PIG-1558 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1558.patch Going to the site directory and running ant produces: {code} ant Buildfile: build.xml clean: [delete] Deleting directory /Users/gates/src/pig/apache/site/author/build update: BUILD FAILED /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: java.io.IOException: Cannot run program forrest (in directory /Users/gates/src/pig/apache/site/author): error=2, No such file or directory {code} Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1559) Several things stated in Pig philosophy page are out of date
Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1558) build.xml for site directory does not work
[ https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901612#action_12901612 ] Olga Natkovich commented on PIG-1558: - +1 build.xml for site directory does not work -- Key: PIG-1558 URL: https://issues.apache.org/jira/browse/PIG-1558 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1558.patch Going to the site directory and running ant produces: {code} ant Buildfile: build.xml clean: [delete] Deleting directory /Users/gates/src/pig/apache/site/author/build update: BUILD FAILED /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: java.io.IOException: Cannot run program forrest (in directory /Users/gates/src/pig/apache/site/author): error=2, No such file or directory {code} Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6
[ https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901617#action_12901617 ] Alan Gates commented on PIG-1508: - I'm guessing the contrib failures are just because Hudson isn't working properly. I run contrib tests only with 1.6 all the time and don't see issues. The site issues I'm talking about are under pig/site (not pig/trunk). I've already posted another patch (see PIG-1558) to deal with it. Make 'docs' target (forrest) work with Java 1.6 --- Key: PIG-1508 URL: https://issues.apache.org/jira/browse/PIG-1508 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Carl Steinbach Assignee: Carl Steinbach Attachments: PIG-1508.patch.txt FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with Java 1.6 The same ticket also suggests a workaround: disabling sitemap and stylesheet validation by setting the forrest.validate.sitemap and forrest.validate.stylesheets properties to false. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Pig optimizer
Hey everyone, I was wondering if anybody has any references or suggestion on how to learn about Pig's optimizer besides the source code or Pig's paper. Thanks in advance. Renato M.
[jira] Updated: (PIG-1510) Add `deepCopy` for LogicalExpressions
[ https://issues.apache.org/jira/browse/PIG-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1510: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk. Thanks Swati for contributing! Add `deepCopy` for LogicalExpressions - Key: PIG-1510 URL: https://issues.apache.org/jira/browse/PIG-1510 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.8.0 Reporter: Swati Jain Assignee: Swati Jain Fix For: 0.8.0 Attachments: deepCopy.patch, deepCopy.patch It would be useful to have a way to `deepCopy` an expression. `deepCopy` will create a new object so that changes made to one object will not reflect in the copy. There are 2 reasons why we don't override clone. * It may be better to use `deepCopy` since the copy semantics are explicit (since deepCopy may be expensive). * A second important reason for defining `deepCopy` as a separate routine is that it can be passed a plan as an argument which will be updated as the expression is copied (through plan.add and plan.connect). The usage would look like the following: {noformat} LogicalExpressionPlan logicalPlan = new LogicalExpressionPlan(); LogicalExpression copyExpression = origExpression.deepCopy( logicalPlan ); {noformat} An immediate motivation for this would be for constructing the expressions that constitute the CNF form of an expression. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1559: Attachment: PIG-1559.patch Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1559) Several things stated in Pig philosophy page are out of date
[ https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1559: Status: Patch Available (was: Open) Several things stated in Pig philosophy page are out of date Key: PIG-1559 URL: https://issues.apache.org/jira/browse/PIG-1559 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.8.0 Attachments: PIG-1559.patch The Pig philosophy page says several things that are no longer true (such as that Pig does not have an optimizer (it does now), that we someday hope to support streaming (we already do), that we some day hope to control splits (we don't, we just use what Hadoop gives us now)). These need to be updated to reflect the current situation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pig optimizer
Hi, Renato, There is a description of optimization rule in Pig Latin reference menu: http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html#Optimization+Rules. Is that enough? Daniel Renato Marroquín Mogrovejo wrote: Hey everyone, I was wondering if anybody has any references or suggestion on how to learn about Pig's optimizer besides the source code or Pig's paper. Thanks in advance. Renato M.
is Hudson awol?
Haven't heard anything from Hudson in a while... -D
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch The add method if PigSplit is removed. The debug code is left to facilitate future debugging work. The use of initNextRecordReader is pretty cloned from org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader and I'll leave it as is too. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901656#action_12901656 ] Richard Ding commented on PIG-1551: --- In Invoker.java, there is a typo: {code} private static final Class? LONG_ARRAY_CLASS = new String[0].getClass(); {code} also in unPrimitivize method, this code seems unnecessary: {code} } else if (klass.equals(DOUBLE_ARRAY_CLASS)) { return DOUBLE_ARRAY_CLASS; {code} Otherwise the patch looks good. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: split operator
Hi Daniel, This is a question long ago, but I suddenly come up with some more thoughts on this. In a query as simple as this: A = LOAD 'input'; B = FILTER A BY $1 == 1; C = COGROUP A BY $0, B BY $0; the optimizer will insert a split operator to reuse A. According to the source code, a map-reduce job will be ended when it sees split and output the result to A1 and A2 which will be used by two subsequent jobs to process B and C. In this case, the first job does nothing meaningful but copy the souce 'input' twice. Is there some optimization applied here (like the MultiQueryOptimizer you mentioned previously) ? How? Since I didn't take a look at the MultiQueryOptimizer, it will be great help if you can briefly describe how MultiQueryOptimizer works. Thanks a lot. -Gang - 原始邮件 发件人: Daniel Dai jiany...@yahoo-inc.com 收件人: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org 发送日期: 2010/7/26 (周一) 4:58:49 下午 主 题: Re: split operator Hi, Gang, It is about multiquery optimization. In MRCompiler, we will create a map-reduce boundary for split, later in MultiQueryOptimizer, we will merge several split into one map-reduce job. In this map-reduce job, we will nest several split plans. Daniel Gang Luo wrote: Hi Daniel, in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split operator maintain one-tuple buffer for each branch and talks about how to synchronize multiple branches. I do think that is the in-memory split. here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf -Gang - 原始邮件 发件人: Daniel Dai jiany...@yahoo-inc.com 收件人: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org 发送日期: 2010/7/26 (周一) 2:09:25 下午 主 题: Re: split operator Hi, Gang, Which part of the paper are you talking about? We don't do in-memory split. We dump the split result to a temporary file and start a new map-reduce job. Split do create a map-reduce boundary (Though it is not entirely true, multiquery optimizer may combine some of these jobs) Daniel Gang Luo wrote: Hi all according to the vldb 09 paper, the split operator and all its successive operators reside in memory without any blocking in between. However, the source code (version 0.7) shows that a MR job is actually ended when it meets the split operator and multiple new MR jobs are created, each representing one branch. This write-once-read-multiple-times method is different from the in-memory method mentioned in that paper. Does pig change the strategy for split, or is there still an in-memory version of split I didn't discover? Thanks, -Gang
[jira] Created: (PIG-1560) Build target 'checkstyle' fails
Build target 'checkstyle' fails --- Key: PIG-1560 URL: https://issues.apache.org/jira/browse/PIG-1560 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Giridharan Kesavan Fix For: 0.8.0 Stack trace: {code} /homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1560) Build target 'checkstyle' fails
[ https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1560: -- Description: Stack trace: {code} /trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} was: Stack trace: {code} /homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} Build target 'checkstyle' fails --- Key: PIG-1560 URL:
Re: split operator
Hi, Gang, Yes, that's what MultiQueryOptimizer address. After splitting, we split the script into smaller combinable pieces, and MultiQueryOptimizer will combine as much splitter and splittees into the same map-reduce job. So after SplitInserter, you might see more jobs, but you will end up with fewer jobs. The algorithm for MultiQueryOptimizer is: for every splitter, find as much combinable splittees, and combine them into the same mapreduce job. You can find more details at http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification Daniel Gang Luo wrote: Hi Daniel, This is a question long ago, but I suddenly come up with some more thoughts on this. In a query as simple as this: A = LOAD 'input'; B = FILTER A BY $1 == 1; C = COGROUP A BY $0, B BY $0; the optimizer will insert a split operator to reuse A. According to the source code, a map-reduce job will be ended when it sees split and output the result to A1 and A2 which will be used by two subsequent jobs to process B and C. In this case, the first job does nothing meaningful but copy the souce 'input' twice. Is there some optimization applied here (like the MultiQueryOptimizer you mentioned previously) ? How? Since I didn't take a look at the MultiQueryOptimizer, it will be great help if you can briefly describe how MultiQueryOptimizer works. Thanks a lot. -Gang - 原始邮件 发件人: Daniel Dai jiany...@yahoo-inc.com 收件人: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org 发送日期: 2010/7/26 (周一) 4:58:49 下午 主 题: Re: split operator Hi, Gang, It is about multiquery optimization. In MRCompiler, we will create a map-reduce boundary for split, later in MultiQueryOptimizer, we will merge several split into one map-reduce job. In this map-reduce job, we will nest several split plans. Daniel Gang Luo wrote: Hi Daniel, in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split operator maintain one-tuple buffer for each branch and talks about how to synchronize multiple branches. I do think that is the in-memory split. here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf -Gang - 原始邮件 发件人: Daniel Dai jiany...@yahoo-inc.com 收件人: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org 发送日期: 2010/7/26 (周一) 2:09:25 下午 主 题: Re: split operator Hi, Gang, Which part of the paper are you talking about? We don't do in-memory split. We dump the split result to a temporary file and start a new map-reduce job. Split do create a map-reduce boundary (Though it is not entirely true, multiquery optimizer may combine some of these jobs) Daniel Gang Luo wrote: Hi all according to the vldb 09 paper, the split operator and all its successive operators reside in memory without any blocking in between. However, the source code (version 0.7) shows that a MR job is actually ended when it meets the split operator and multiple new MR jobs are created, each representing one branch. This write-once-read-multiple-times method is different from the in-memory method mentioned in that paper. Does pig change the strategy for split, or is there still an in-memory version of split I didn't discover? Thanks, -Gang
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch Fix a typo; rebase on the latest trunk. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1515) Migrate logical optimization rule: PushDownForeachFlatten
[ https://issues.apache.org/jira/browse/PIG-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1515: - Status: Patch Available (was: Open) Migrate logical optimization rule: PushDownForeachFlatten - Key: PIG-1515 URL: https://issues.apache.org/jira/browse/PIG-1515 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1515-1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Attachment: PIG-1557.patch The alias for load statement is missing. Add load alias to the alias list. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Fix Version/s: 0.8.0 couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Patch Available (was: Open) Release Note: Feature: combine splits of sizes smaller than the value of property pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is not set, the file system default block size of the load's location. This feature can be turned off through setting the property pig.noSplitCombination to true. When such a combination is performed, a log message like Total input paths (combined) to process : 7 will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more under-fed mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files
XMLLoader in Piggybank does not support bz2 or gzip compressed XML files Key: PIG-1561 URL: https://issues.apache.org/jira/browse/PIG-1561 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat I have a simple Pig script which uses the XMLLoader after the Piggybank is built. {code} register piggybank.jar; A = load '/user/viraj/capacity-scheduler.xml.gz' using org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray); B = limit A 1; dump B; --store B into '/user/viraj/handlegz' using PigStorage(); {code} returns empty tuple {code} () {code} If you supply the uncompressed XML file, you get {code} (property namemapred.capacity-scheduler.queue.my.capacity/name value10/value descriptionPercentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. /description /property) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901697#action_12901697 ] Dmitriy V. Ryaboy commented on PIG-1555: Alan, The differences I observe when running on actual csv files are within the margin of error -- sometimes CSVLoader comes out on top. Then again I am reading actual CSVs with quoted commas, so it's possible that the similarity in runtimes is due to the fact that PigStorage sees the commas and allocates extra tuple fields. -D [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.