[jira] Updated: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1334: --- Attachment: mvn_pig_2.patch Based on the feedback, I am renaming the pig jars to the old names. I had changed names to make them compatible with the maven naming standard. I am also putting pig.jar to maven repository rather than the pig-core-{version}.jar to the maven repo as the udf builders need the full jar rather than just the core jar. Make pig artifacts available through maven -- Key: PIG-1334 URL: https://issues.apache.org/jira/browse/PIG-1334 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Attachments: mvn-pig.patch, mvn_pig_2.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path
[ https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886274#action_12886274 ] Hadoop QA commented on PIG-1486: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448935/PIG-1486.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/341/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/341/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/341/console This message is automatically generated. update ant eclipse-files target to include new jar and remove contrib dirs from build path -- Key: PIG-1486 URL: https://issues.apache.org/jira/browse/PIG-1486 Project: Pig Issue Type: Bug Components: tools Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 Attachments: PIG-1486.patch .eclipse.templates/.classpath needs to be updated to address following - 1. There is a new jar that is used by the code - guava-r03.jar 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse. 3. Removing the contrib projects from class path as discussed in PIG-1390, until all libs necessary for the contribs are included in classpath. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886281#action_12886281 ] Hadoop QA commented on PIG-1472: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448937/PIG-1472.2.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 69 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 148 javac compiler warnings (more than the trunk's current 145 warnings). -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. -1 release audit. The applied patch generated 400 release audit warnings (more than the trunk's current 399 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/console This message is automatically generated. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886332#action_12886332 ] Olga Natkovich commented on PIG-1484: - I think that's different. Globbing means - give me any data that matches the globe. I think semantics of list is that all elements must exist. What does PigStorage do? BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1461) support union operation that merges based on column names
[ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1461: --- Assignee: Thejas M Nair support union operation that merges based on column names - Key: PIG-1461 URL: https://issues.apache.org/jira/browse/PIG-1461 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 When the data has schema, it often makes sense to union on column names in schema rather than the position of the columns. The behavior of existing union operator should remain backward compatible . This feature can be supported using either a new operator or extending union to support 'using' clause . I am thinking of having a new operator called either unionschema or merge . Does anybody have any other suggestions for the syntax ? example - L1 = load 'x' as (a,b); L2 = load 'y' as (b,c); U = unionschema L1, L2; describe U; U: {a:bytearray, b:byetarray, c:bytearray} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886362#action_12886362 ] Richard Ding commented on PIG-1389: --- Locally ran and passed core unit tests. Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch, PIG-1389_2.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886365#action_12886365 ] Daniel Dai commented on PIG-1484: - That's a good point. We shall follow what PigStorage does. PigStorage needs all file exist. Will change the patch. BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Attachment: PIG-1484-2.patch BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Status: Patch Available (was: Open) BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Status: Open (was: Patch Available) BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886375#action_12886375 ] Olga Natkovich commented on PIG-1484: - +1 to the code changes. I think it will be good if the test case actually verified that it got all the data it expects not just that it can get to the data. BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886415#action_12886415 ] Robert Gibbon commented on PIG-366: --- I submitted some basic fixes to the pig-eclipse plugin today, so that it works on my environment (osx / java1.5 32bit) - maybe I should volunteer to take on PigPen, if no further progress has been made on this? Lemme know if I can help PigPen - Eclipse plugin for a graphical PigLatin editor --- Key: PIG-366 URL: https://issues.apache.org/jira/browse/PIG-366 Project: Pig Issue Type: New Feature Reporter: Shubham Chopra Assignee: Daniel Dai Priority: Minor Attachments: org.apache.pig.pigpen_0.0.1.jar, org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, pigpen.patch, pigPen.patch, PigPen.tgz This is an Eclipse plugin that provides a GUI that can help users create PigLatin scripts and see the example generator outputs on the fly and submit the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-155) logo improvement
[ https://issues.apache.org/jira/browse/PIG-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Gibbon updated PIG-155: -- Attachment: pig.png As a pig user, I'd love to see you adopt a better logo (or at least a less crunchy graphic). I know that the Disney character has been around a while, but certainly not so long that it is too late to swap him out...call him a technical debt, perhaps? The suggested image is originally sourced from the public domain. logo improvement Key: PIG-155 URL: https://issues.apache.org/jira/browse/PIG-155 Project: Pig Issue Type: Improvement Reporter: Stefan Groschupf Assignee: Stefan Groschupf Priority: Trivial Fix For: 0.2.0 Attachments: 080224_logo_pig_01_rgb.jpg, pig.png, pig_logo_improvement.zip -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886427#action_12886427 ] Olga Natkovich commented on PIG-366: Hi Robert, We would love it if you decide to own PigPen - it is all yours! :) PigPen - Eclipse plugin for a graphical PigLatin editor --- Key: PIG-366 URL: https://issues.apache.org/jira/browse/PIG-366 Project: Pig Issue Type: New Feature Reporter: Shubham Chopra Assignee: Daniel Dai Priority: Minor Attachments: org.apache.pig.pigpen_0.0.1.jar, org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, pigpen.patch, pigPen.patch, PigPen.tgz This is an Eclipse plugin that provides a GUI that can help users create PigLatin scripts and see the example generator outputs on the fly and submit the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-736) Inconsistent error message when the message should be about org.apache.hadoop.fs.permission.AccessControlException: Permission denied:
[ https://issues.apache.org/jira/browse/PIG-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-736. Fix Version/s: 0.7.0 Resolution: Fixed slicer code is completely gone with Pig 0.7.0. Please, reopen if the error message is still not clear Inconsistent error message when the message should be about org.apache.hadoop.fs.permission.AccessControlException: Permission denied: -- Key: PIG-736 URL: https://issues.apache.org/jira/browse/PIG-736 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.7.0 Attachments: pig_latestversion_errmsg.log, pig_oldversion_errmsg.log Suppose I have Pig script which accesses a directory in HDFS for which I do not have permissions shell hadoop fs -ls /mydata/group_permissions/ drwxr-x--- - groupuser restrictedgroup 0 2009-03-24 10:58 /mydata/group_permissions/20090323 {code} %default dates_to_process '20090323' MYDATA = load '/mydata/group_permissions/{$dates_to_process}*' using PigStorage() as (col1,col2,col3) ; MYDATA_PROJECT = foreach MYDATA generate (chararray) col1#'acct' as acct, (int)col1#'country' as country, (int)col1#'product' as product dump MYDATA_PROJECT; {code} The error message we get is: === 2009-03-26 00:00:05,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2099: Problem in constructing slices. Details at logfile: /home/viraj/pig_1238025596328.log === This message is definitely hard to debug === With the previous version 1.0.0 I get the following error message, which is more appropriate to this case. === 2009-03-26 00:01:41,787 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - java.io.IOException: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode=20090323:groupuser:restrictedgroup:rwxr-x--- [org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode=20090323:groupuser:restrictedgroup:rwxr-x---] at org.apache.pig.backend.hadoop.datastorage.HDirectory.iterator(HDirectory.java:157) at org.apache.pig.backend.executionengine.PigSlicer.slice(PigSlicer.java:77) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:206) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode=20090323:groupuser:restrictedgroup:rwxr-x--- ... 8 more 2009-03-26 00:01:41,798 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias MYDATA_PROJECT Details at logfile: /home/viraj/pig_1238025692361.log === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1488) Make HDFS temp dir configurable
Make HDFS temp dir configurable --- Key: PIG-1488 URL: https://issues.apache.org/jira/browse/PIG-1488 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Fix For: 0.8.0 Currently it is hardcoded to /tmp. It should be made into a property. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode
[ https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-755. Resolution: Fixed With Pig 0.7.0, local mode and MR code use the same code path. Difficult to debug parameter substitution problems based on the error messages when running in local mode - Key: PIG-755 URL: https://issues.apache.org/jira/browse/PIG-755 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Attachments: inputfile.txt, localparamsub.pig I have a script in which I do a parameter substitution for the input file. I have a use case where I find it difficult to debug based on the error messages in local mode. {code} A = load '$infile' using PigStorage() as ( date: chararray, count : long, gmean : double ); dump A; {code} 1) I run it in local mode with the input file in the current working directory {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -exectype local -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:03:51,967 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore - Received error from storer function: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function. 2009-04-07 00:03:51,970 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!! 2009-04-07 00:03:51,971 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 failed! 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log ERROR 1066: Unable to open iterator for alias A org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:352) Caused by: java.io.IOException: Job terminated with anomalous status FAILED at org.apache.pig.PigServer.openIterator(PigServer.java:433) ... 5 more 2) I run it in map reduce mode {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:07:31,660 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-04-07 00:07:32,074 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-04-07 00:07:34,543 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-07 00:07:39,540 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-07 00:07:39,540 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: inputfile does not exist. Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log ERROR 2100: inputfile does not exist. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:352) Caused by:
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Status: Open (was: Patch Available) BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Status: Patch Available (was: Open) BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Attachment: PIG-1484-3.patch Sure, reattach the patch. BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886467#action_12886467 ] Renato Javier MarroquĂn Mogrovejo commented on PIG-366: --- Hey Robert, I would be happy to help out too (= I have been looking for an interesting pig project, let me know how I can help, or how we can share the work load. Renato M. PigPen - Eclipse plugin for a graphical PigLatin editor --- Key: PIG-366 URL: https://issues.apache.org/jira/browse/PIG-366 Project: Pig Issue Type: New Feature Reporter: Shubham Chopra Assignee: Daniel Dai Priority: Minor Attachments: org.apache.pig.pigpen_0.0.1.jar, org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, pigpen.patch, pigPen.patch, PigPen.tgz This is an Eclipse plugin that provides a GUI that can help users create PigLatin scripts and see the example generator outputs on the fly and submit the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
[ https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1442: Assignee: Thejas M Nair Fix Version/s: 0.8.0 java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766) --- Key: PIG-1442 URL: https://issues.apache.org/jira/browse/PIG-1442 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev (18/may) Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0 Reporter: Dirk Schmid Assignee: Thejas M Nair Fix For: 0.8.0 As mentioned by Ashutosh this is a reopen of https://issues.apache.org/jira/browse/PIG-766 because there is still a problem which causes that PIG scales only by memory. For convenience here comes the last entry of the PIG-766-Jira-Ticket: {quote}1. Are you getting the exact same stack trace as mentioned in the jira?{quote} Yes the same and some similar traces: {noformat} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179) at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880) at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501) java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263) at org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Attachment: PIG-1483.patch This is the initial patch with a few caveats: # Each mapper processes only one job history file. This loader will create as many map tasks as the number of files to process. # It uses _org.apache.hadoop.mapred.DefaultJobHistoryParser_ to parse the job history files. This parser isn't production ready. [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1481) PigServer throws exception if it cannot find hadoop-site.xml or core-site.xml
[ https://issues.apache.org/jira/browse/PIG-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1481. - Resolution: Won't Fix PigServer throws exception if it cannot find hadoop-site.xml or core-site.xml - Key: PIG-1481 URL: https://issues.apache.org/jira/browse/PIG-1481 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Sameer M Hi We've been using the Hadoop MiniCluster to do unit testing of our pig scripts in the following way. MiniCluster minicluster = MiniCluster.buildCluster(2,2); pigServer = new PigServer(ExecType.MAPREDUCE, minicluster.getProperties()); This has been working fine for 0.6 and 0.7. However in the trunk (0.8) looks like there is change due to which an exception is thrown if hadoop-site.xml or core-site.xml is not found in the classpath. org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use local mode, please put -x local option in command line at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114) at org.apache.pig.impl.PigContext.connect(PigContext.java:177) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.PigServer.init(PigServer.java:204) at org.apache.pig.PigServer.init(PigServer.java:200) The problem seems to be org.apache.pig.backend.hadoop.executionengine.HExecutionEngine: 148 if( hadoop_site == null core_site == null ) { throw new ExecException(Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath). + If you plan to use local mode, please put -x local option in command line, 4010); } We would like to use the mapreduce mode but with the minicluster and have a lot of unit test with that setup. Can this check be removed from this level ? Thanks Sameer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886478#action_12886478 ] Olga Natkovich commented on PIG-1484: - +1 BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1484: Fix Version/s: 0.7.0 BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886502#action_12886502 ] Richard Ding commented on PIG-1483: --- Usage: {code} register piggybank.jar A = load 'directory or file' org.apache.pig.piggybank.storage.HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); {code} where j is a map with following entries: {code} JOBID, JOBNAME, CLUSTER, QUEUE_NAME, STATUS, PIG_VERSION, HADOOP_VERSION, USER, USER_GROUP, HOST_DIR, JOBCONF, PIG_SCRIPT_ID, PIG_SCRIPT, TOTAL_LAUNCHED_MAPS, TOTAL_MAPS, FINISHED_MAPS, FAILED_MAPS, RACK_LOCAL_MAPS, DATA_LOCAL_MAPS, TOTAL_LAUNCHED_REDUCES, TOTAL_REDUCES, FINISHED_REDUCES, FAILED_REDUCES, SUBMIT_TIME, LAUNCH_TIME, FINISH_TIME, MAP_INPUT_RECORDS, MAP_OUTPUT_RECORDS, MAP_OUTPUT_BYTES, COMBINE_INPUT_RECORDS, COMBINE_OUTPUT_RECORDS, SPILLED_RECORDS, REDUCE_SHUFFLE_BYTES, REDUCE_INPUT_GROUPS, REDUCE_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS, HDFS_BYTES_READ, HDFS_BYTES_WRITTEN, FILE_BYTES_READ, FILE_BYTES_WRITTEN, {code} m is a map with following entries: {code} MAX_MAP_INPUT_ROWS, MIN_MAP_INPUT_ROWS, MAX_MAP_TIME, MIN_MAP_TIME, AVG_MAP_TIME, NUMBER_MAPS {code} r is a map with following entries: {code} AVG_REDUCE_TIME, MAX_REDUCE_TIME, NUMBER_REDUCES, MIN_REDUCE_TIME, MIN_REDUCE_INPUT_ROWS, MAX_REDUCE_INPUT_ROWS {code} [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1343: Fix Version/s: 0.8.0 pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1343-1.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1489) Pig MapReduceLauncher does not use jars in register statement
Pig MapReduceLauncher does not use jars in register statement --- Key: PIG-1489 URL: https://issues.apache.org/jira/browse/PIG-1489 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Fix For: 0.8.0 If my Pig StorFunc has its own OutputFormat class then Pig MapReducelauncher will try to instantiate it before launching the mapreduce job and fail with ClassNotFoundException. This happens because Pig MapReduce launcher uses its own classloader and ignores the classes in the jars in the register statement. The effect is that the jars not only have to be in register statement in the script but also in the pig classpath with the -classpath tag. This can be remedied by making the Pig MapReduceLauncher constructing a classloader that includes the registered jars and using that to instantiate the OutputFormat class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886530#action_12886530 ] Aniket Mokashi commented on PIG-928: I have uploaded a wiki page to mention the usage and syntax-- http://wiki.apache.org/pig/UDFsUsingScriptingLanguages. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Attachment: (was: PIG-1483.patch) [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886537#action_12886537 ] Richard Ding commented on PIG-1483: --- Add these additional entries to the first map: {code} PIG_JOB_FEATURE, PIG_JOB_ALIAS, PIG_JOB_PARENTS {code} [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Attachment: PIG-1483.patch [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch, PIG-1483.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886538#action_12886538 ] Hadoop QA commented on PIG-1484: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448988/PIG-1484-2.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/363/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/363/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/363/console This message is automatically generated. BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1472: --- Status: Open (was: Patch Available) Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-928: --- Status: Open (was: Patch Available) UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1472: --- Attachment: PIG-1472.3.patch Patch with fix for javac,javadoc and findbugs warnings. The tests that were reported as failed pass when I ran them on my machine, the failures seem to have been caused by problems in hudson environment. Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
[ https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1472: --- Status: Patch Available (was: Open) Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1484) BinStorage should support comma seperated path
[ https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886591#action_12886591 ] Hadoop QA commented on PIG-1484: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449001/PIG-1484-3.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/342/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/342/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/342/console This message is automatically generated. BinStorage should support comma seperated path -- Key: PIG-1484 URL: https://issues.apache.org/jira/browse/PIG-1484 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0, 0.8.0 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch BinStorage does not take comma seperated path. The following script fail: a = load '1.bin,2.bin' using BinStorage(); dump a; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886610#action_12886610 ] Hadoop QA commented on PIG-928: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449018/RegisterPythonUDF_Final.patch against trunk revision 960062. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 146 javac compiler warnings (more than the trunk's current 145 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/console This message is automatically generated. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.