[jira] Updated: (PIG-1334) Make pig artifacts available through maven

2010-07-08 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1334:
---

Attachment: mvn_pig_2.patch

Based on the feedback, I am renaming the pig jars to the old names. I had 
changed names to make them compatible with the maven naming standard. I am also 
putting pig.jar to maven repository rather than the pig-core-{version}.jar to 
the maven repo as the udf builders need the full jar rather than just the core 
jar.

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886274#action_12886274
 ] 

Hadoop QA commented on PIG-1486:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448935/PIG-1486.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/341/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/341/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/341/console

This message is automatically generated.

 update ant eclipse-files target to include new jar and remove contrib dirs 
 from build path
 --

 Key: PIG-1486
 URL: https://issues.apache.org/jira/browse/PIG-1486
 Project: Pig
  Issue Type: Bug
  Components: tools
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1486.patch


  .eclipse.templates/.classpath needs to be updated to address following -
 1. There is a new jar that is used by the code - guava-r03.jar
 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse.
 3. Removing the contrib projects from class path as discussed in PIG-1390, 
 until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886281#action_12886281
 ] 

Hadoop QA commented on PIG-1472:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448937/PIG-1472.2.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 69 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

-1 javac.  The applied patch generated 148 javac compiler warnings (more 
than the trunk's current 145 warnings).

-1 findbugs.  The patch appears to introduce 2 new Findbugs warnings.

-1 release audit.  The applied patch generated 400 release audit warnings 
(more than the trunk's current 399 warnings).

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/console

This message is automatically generated.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886332#action_12886332
 ] 

Olga Natkovich commented on PIG-1484:
-

I think that's different. Globbing means - give me any data that matches the 
globe. I think semantics of list is that all elements must exist. What does 
PigStorage do?

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1461) support union operation that merges based on column names

2010-07-08 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1461:
---

Assignee: Thejas M Nair

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-07-08 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886362#action_12886362
 ] 

Richard Ding commented on PIG-1389:
---

Locally ran and passed core unit tests.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch, 
 PIG-1389_2.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886365#action_12886365
 ] 

Daniel Dai commented on PIG-1484:
-

That's a good point. We shall follow what PigStorage does. PigStorage needs all 
file exist. Will change the patch.

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Attachment: PIG-1484-2.patch

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Status: Patch Available  (was: Open)

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Status: Open  (was: Patch Available)

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886375#action_12886375
 ] 

Olga Natkovich commented on PIG-1484:
-

+1 to the code changes. I think it will be good if the test case actually 
verified that it got all the data it expects not just that it can get to the 
data.

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-07-08 Thread Robert Gibbon (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886415#action_12886415
 ] 

Robert Gibbon commented on PIG-366:
---

I submitted some basic fixes to the pig-eclipse plugin today, so that it works 
on my environment (osx / java1.5 32bit) - maybe I should volunteer to take on 
PigPen, if no further progress has been made on this?

Lemme know if I can help

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Daniel Dai
Priority: Minor
 Attachments: org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-155) logo improvement

2010-07-08 Thread Robert Gibbon (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gibbon updated PIG-155:
--

Attachment: pig.png

As a pig user, I'd love to see you adopt a better logo (or at least a less 
crunchy graphic).

I know that the Disney character has been around a while, but certainly not so 
long that it is too late to swap him out...call him a technical debt, perhaps?

The suggested image is originally sourced from the public domain.

 logo improvement
 

 Key: PIG-155
 URL: https://issues.apache.org/jira/browse/PIG-155
 Project: Pig
  Issue Type: Improvement
Reporter: Stefan Groschupf
Assignee: Stefan Groschupf
Priority: Trivial
 Fix For: 0.2.0

 Attachments: 080224_logo_pig_01_rgb.jpg, pig.png, 
 pig_logo_improvement.zip




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-07-08 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886427#action_12886427
 ] 

Olga Natkovich commented on PIG-366:


Hi Robert,

We would love it if you decide to own PigPen - it is all yours! :)

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Daniel Dai
Priority: Minor
 Attachments: org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-736) Inconsistent error message when the message should be about org.apache.hadoop.fs.permission.AccessControlException: Permission denied:

2010-07-08 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-736.


Fix Version/s: 0.7.0
   Resolution: Fixed

slicer code is completely gone with Pig 0.7.0. Please, reopen if the error 
message is still not clear

 Inconsistent error message when the message should be about 
 org.apache.hadoop.fs.permission.AccessControlException: Permission denied:
 --

 Key: PIG-736
 URL: https://issues.apache.org/jira/browse/PIG-736
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.7.0

 Attachments: pig_latestversion_errmsg.log, pig_oldversion_errmsg.log


 Suppose I have Pig script which accesses a directory in HDFS for which I do 
 not have permissions
 shell hadoop fs -ls /mydata/group_permissions/
 drwxr-x---   - groupuser restrictedgroup  0 2009-03-24 10:58 
 /mydata/group_permissions/20090323
 {code}
 %default dates_to_process '20090323'
 MYDATA = load '/mydata/group_permissions/{$dates_to_process}*' using
 PigStorage() as (col1,col2,col3) ;
 MYDATA_PROJECT = foreach MYDATA generate
 (chararray) col1#'acct' as acct,
 (int)col1#'country' as country,
 (int)col1#'product' as product
 dump MYDATA_PROJECT;
 {code}
 The error message we get is:
 ===
 2009-03-26 00:00:05,753 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2099: Problem in constructing slices.
 Details at logfile: /home/viraj/pig_1238025596328.log
 ===
 This message is definitely hard to debug
 ===
 With the previous version 1.0.0 I get the following error message, which is 
 more appropriate to this case.
 ===
 2009-03-26 00:01:41,787 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - java.io.IOException: 
 org.apache.hadoop.fs.permission.AccessControlException: Permission denied: 
 user=viraj, access=READ_EXECUTE, 
 inode=20090323:groupuser:restrictedgroup:rwxr-x--- 
 [org.apache.hadoop.fs.permission.AccessControlException: Permission denied: 
 user=viraj, access=READ_EXECUTE, 
 inode=20090323:groupuser:restrictedgroup:rwxr-x---]
 at 
 org.apache.pig.backend.hadoop.datastorage.HDirectory.iterator(HDirectory.java:157)
 at 
 org.apache.pig.backend.executionengine.PigSlicer.slice(PigSlicer.java:77)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:206)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.RuntimeException: 
 org.apache.hadoop.fs.permission.AccessControlException: Permission denied: 
 user=viraj, access=READ_EXECUTE, 
 inode=20090323:groupuser:restrictedgroup:rwxr-x---
 ... 8 more
 2009-03-26 00:01:41,798 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1066: Unable to open iterator for alias MYDATA_PROJECT
 Details at logfile:  /home/viraj/pig_1238025692361.log
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1488) Make HDFS temp dir configurable

2010-07-08 Thread Olga Natkovich (JIRA)
Make HDFS temp dir configurable
---

 Key: PIG-1488
 URL: https://issues.apache.org/jira/browse/PIG-1488
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
 Fix For: 0.8.0


Currently it is hardcoded to /tmp. It should be made into a property.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode

2010-07-08 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-755.


Resolution: Fixed

With Pig 0.7.0, local mode and MR code use the same code path. 

 Difficult to debug parameter substitution problems based on the error 
 messages when running in local mode
 -

 Key: PIG-755
 URL: https://issues.apache.org/jira/browse/PIG-755
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Attachments: inputfile.txt, localparamsub.pig


 I have a script in which I do a parameter substitution for the input file. I 
 have a use case where I find it difficult to debug based on the error 
 messages in local mode.
 {code}
 A = load '$infile' using PigStorage() as
  (
date: chararray,
count   : long,
gmean   : double
 );
 dump A;
 {code}
 1) I run it in local mode with the input file in the current working directory
 {code}
 prompt  $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main 
 -exectype local -param infile='inputfile.txt' localparamsub.pig
 {code}
 2009-04-07 00:03:51,967 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore
  - Received error from storer function: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
 setup the load function.
 2009-04-07 00:03:51,970 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!!
 2009-04-07 00:03:51,971 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 
 failed!
 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1066: Unable to open iterator for alias A
 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log
 
 ERROR 1066: Unable to open iterator for alias A
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias A
 at org.apache.pig.PigServer.openIterator(PigServer.java:439)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:352)
 Caused by: java.io.IOException: Job terminated with anomalous status FAILED
 at org.apache.pig.PigServer.openIterator(PigServer.java:433)
 ... 5 more
 
 2) I run it in map reduce mode
 {code}
 prompt  $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param 
 infile='inputfile.txt' localparamsub.pig
 {code}
 2009-04-07 00:07:31,660 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:9000
 2009-04-07 00:07:32,074 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:9001
 2009-04-07 00:07:34,543 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-04-07 00:07:39,540 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-04-07 00:07:39,540 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Map reduce job failed
 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2100: inputfile does not exist.
 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log
 
 ERROR 2100: inputfile does not exist.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias A
 at org.apache.pig.PigServer.openIterator(PigServer.java:439)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:352)
 Caused by: 

[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Status: Open  (was: Patch Available)

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Status: Patch Available  (was: Open)

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Attachment: PIG-1484-3.patch

Sure, reattach the patch.

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-07-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886467#action_12886467
 ] 

Renato Javier MarroquĂ­n Mogrovejo commented on PIG-366:
---

Hey Robert, I would be happy to help out too (=
I have been looking for an interesting pig project, let me know how I can help, 
or how we can share the work load.

Renato M.

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Daniel Dai
Priority: Minor
 Attachments: org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)

2010-07-08 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1442:


 Assignee: Thejas M Nair
Fix Version/s: 0.8.0

 java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
 ---

 Key: PIG-1442
 URL: https://issues.apache.org/jira/browse/PIG-1442
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
 (18/may)
 Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0
Reporter: Dirk Schmid
Assignee: Thejas M Nair
 Fix For: 0.8.0


 As mentioned by Ashutosh this is a reopen of 
 https://issues.apache.org/jira/browse/PIG-766 because there is still a 
 problem which causes that PIG scales only by memory.
 For convenience here comes the last entry of the PIG-766-Jira-Ticket:
 {quote}1. Are you getting the exact same stack trace as mentioned in the 
 jira?{quote} Yes the same and some similar traces:
 {noformat}
 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2786)
   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
   at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
   at 
 org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
   at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
 java.lang.OutOfMemoryError: Java heap space
   at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
   at 
 org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 
 org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
   at 
 org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at 
 

[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-07-08 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Attachment: PIG-1483.patch

This is the initial patch with a few caveats:

# Each mapper processes only one job history file. This loader will create as 
many map tasks as the number of files to process.
# It uses _org.apache.hadoop.mapred.DefaultJobHistoryParser_ to parse the job 
history files. This parser isn't production ready.



 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1481) PigServer throws exception if it cannot find hadoop-site.xml or core-site.xml

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1481.
-

Resolution: Won't Fix

 PigServer throws exception if it cannot find hadoop-site.xml or core-site.xml
 -

 Key: PIG-1481
 URL: https://issues.apache.org/jira/browse/PIG-1481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Sameer M

 Hi
 We've been using the Hadoop MiniCluster to do unit testing of our pig scripts 
 in the following way.
 MiniCluster minicluster = MiniCluster.buildCluster(2,2);
 pigServer = new  PigServer(ExecType.MAPREDUCE, minicluster.getProperties());
 This has been working fine for 0.6 and 0.7. 
 However in the trunk (0.8) looks like there is change due to which an 
 exception is thrown if hadoop-site.xml or core-site.xml is not found in the 
 classpath.
 org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find 
 hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml 
 was found in the classpath).If you plan to use local mode, please put -x 
 local option in command line
   at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149)
   at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114)
   at org.apache.pig.impl.PigContext.connect(PigContext.java:177)
   at org.apache.pig.PigServer.init(PigServer.java:215)
   at org.apache.pig.PigServer.init(PigServer.java:204)
   at org.apache.pig.PigServer.init(PigServer.java:200)
 The problem seems to be 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine: 148
 if( hadoop_site == null  core_site == null ) {
   throw new ExecException(Cannot find hadoop configurations in 
 classpath (neither hadoop-site.xml nor core-site.xml was found in the 
 classpath). +
   If you plan to use local mode, please put -x 
 local option in command line, 
   4010);
 }
 We would like to use the mapreduce mode but with the minicluster and have a 
 lot of unit test with that setup.
 Can this check be removed from this level ?
 Thanks
 Sameer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886478#action_12886478
 ] 

Olga Natkovich commented on PIG-1484:
-

+1

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1484:


Fix Version/s: 0.7.0

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-07-08 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886502#action_12886502
 ] 

Richard Ding commented on PIG-1483:
---


Usage:

{code}
register piggybank.jar
A = load 'directory or file' 
org.apache.pig.piggybank.storage.HadoopJobHistoryLoader() as (j:map[], m:map[], 
r:map[]);
{code}

where j is a map with following entries:

{code}

JOBID, JOBNAME, CLUSTER, QUEUE_NAME, STATUS, PIG_VERSION, HADOOP_VERSION, USER, 
USER_GROUP, HOST_DIR,
JOBCONF, PIG_SCRIPT_ID, PIG_SCRIPT,

TOTAL_LAUNCHED_MAPS, TOTAL_MAPS, FINISHED_MAPS, FAILED_MAPS, RACK_LOCAL_MAPS, 
DATA_LOCAL_MAPS,

TOTAL_LAUNCHED_REDUCES, TOTAL_REDUCES, FINISHED_REDUCES, FAILED_REDUCES,

SUBMIT_TIME, LAUNCH_TIME, FINISH_TIME,

MAP_INPUT_RECORDS, MAP_OUTPUT_RECORDS, MAP_OUTPUT_BYTES,
COMBINE_INPUT_RECORDS, COMBINE_OUTPUT_RECORDS, SPILLED_RECORDS,
REDUCE_SHUFFLE_BYTES, REDUCE_INPUT_GROUPS, REDUCE_INPUT_RECORDS, 
REDUCE_OUTPUT_RECORDS,

HDFS_BYTES_READ, HDFS_BYTES_WRITTEN, FILE_BYTES_READ, FILE_BYTES_WRITTEN,
{code}

m is a map with following entries:

{code}
MAX_MAP_INPUT_ROWS, MIN_MAP_INPUT_ROWS, MAX_MAP_TIME, MIN_MAP_TIME, 
AVG_MAP_TIME, NUMBER_MAPS
{code}

r is a map with following entries:

{code}
AVG_REDUCE_TIME, MAX_REDUCE_TIME, NUMBER_REDUCES, MIN_REDUCE_TIME, 
MIN_REDUCE_INPUT_ROWS, MAX_REDUCE_INPUT_ROWS
{code}

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-07-08 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1343:


Fix Version/s: 0.8.0

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Ashitosh Darbarwar
 Fix For: 0.8.0

 Attachments: PIG-1343-1.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1489) Pig MapReduceLauncher does not use jars in register statement

2010-07-08 Thread Olga Natkovich (JIRA)
 Pig MapReduceLauncher does not use jars in register statement 
---

 Key: PIG-1489
 URL: https://issues.apache.org/jira/browse/PIG-1489
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Fix For: 0.8.0


If my Pig StorFunc has its own OutputFormat class then Pig MapReducelauncher 
will try to instantiate it before
launching the mapreduce job and fail with ClassNotFoundException.

This happens because Pig MapReduce launcher uses its own classloader and 
ignores the classes in the jars in the
register statement.

The effect is that the jars not only have to be in register  statement in the 
script but also in the pig
classpath with the -classpath tag. 

This can be remedied by making the Pig MapReduceLauncher constructing a 
classloader that includes the registered jars
and using that to instantiate the OutputFormat class.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-08 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886530#action_12886530
 ] 

Aniket Mokashi commented on PIG-928:


I have uploaded a wiki page to mention the usage and syntax-- 
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages.

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, 
 RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, 
 RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, 
 RegisterPythonUDFFinale3.patch, RegisterScriptUDFDefineParse.patch, 
 scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-07-08 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Attachment: (was: PIG-1483.patch)

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-07-08 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886537#action_12886537
 ] 

Richard Ding commented on PIG-1483:
---


Add these additional entries to the first map:

{code}
PIG_JOB_FEATURE, PIG_JOB_ALIAS, PIG_JOB_PARENTS
{code}

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-07-08 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Attachment: PIG-1483.patch

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch, PIG-1483.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886538#action_12886538
 ] 

Hadoop QA commented on PIG-1484:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448988/PIG-1484-2.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/363/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/363/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/363/console

This message is automatically generated.

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-08 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1472:
---

Status: Open  (was: Patch Available)

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-928) UDFs in scripting languages

2010-07-08 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-928:
---

Status: Open  (was: Patch Available)

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, 
 RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, 
 RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, 
 RegisterPythonUDFFinale3.patch, RegisterScriptUDFDefineParse.patch, 
 scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-08 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1472:
---

Attachment: PIG-1472.3.patch

Patch with fix for javac,javadoc and findbugs warnings. The tests that were 
reported as failed pass when I ran them on my machine, the failures seem to 
have been caused by problems in hudson environment.


 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-08 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1472:
---

Status: Patch Available  (was: Open)

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1484) BinStorage should support comma seperated path

2010-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886591#action_12886591
 ] 

Hadoop QA commented on PIG-1484:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12449001/PIG-1484-3.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/342/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/342/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/342/console

This message is automatically generated.

 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886610#action_12886610
 ] 

Hadoop QA commented on PIG-928:
---

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12449018/RegisterPythonUDF_Final.patch
  against trunk revision 960062.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 146 javac compiler warnings (more 
than the trunk's current 145 warnings).

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/364/console

This message is automatically generated.

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, 
 RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, 
 RegisterPythonUDF_Final.patch, RegisterPythonUDFFinale.patch, 
 RegisterPythonUDFFinale3.patch, RegisterScriptUDFDefineParse.patch, 
 scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.