date:20100723


[ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891685#action_12891685
 ] 

Richard Ding commented on PIG-1505:
---


You can take a look at the test cases in TestPigRunner where local Pig scripts 
are passed to the PigRunner.run method. 

You can first copy a local Pig script to the mini-cluster using

{code}
Util.copyFromLocalToCluster(cluster, localScriptFileName, 
scriptFileNameOnCluster);
{code}

and then invoke run method with argument

{code}
String[] args = { -f, hdfs://scriptFileNameOnCluste };
PigRunner.run(args, null);
{code}

 support jars and scripts in dfs
 ---

 Key: PIG-1505
 URL: https://issues.apache.org/jira/browse/PIG-1505
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Attachments: pig-jars-and-scripts-from-dfs-3.patch, 
 pig-jars-and-scripts-from-dfs-trunk-1.patch, 
 pig-jars-and-scripts-from-dfs-trunk-2.patch, 
 pig-jars-and-scripts-from-dfs-trunk.patch


 Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1435) make sure dependent jobs fail when a jon in multiquery fails


 [ 
https://issues.apache.org/jira/browse/PIG-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1435:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk. Thanks Niraj.

 make sure dependent jobs fail when a jon in multiquery fails
 

 Key: PIG-1435
 URL: https://issues.apache.org/jira/browse/PIG-1435
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: depJobs.patch, depJobsFailure.patch, 
 depJobsFailure2.patch, depJobsFailure3.patch


 Currently if one of the MQ jobs fails, Pig tries to run all remainin jobs. As 
 the result, if data was partially generated by the failed job, you might get 
 incorrect results from dependent jobs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce


 [ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-1516:
--

Assignee: Thejas M Nair

 finalize in bag implementations causes pig to run out of memory in reduce 
 --

 Key: PIG-1516
 URL: https://issues.apache.org/jira/browse/PIG-1516
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 *Problem:*
 pig bag implementations that are subclasses of DefaultAbstractBag, have 
 finalize methods implemented. As a result, the garbage collector moves them 
 to a finalization queue, and the memory used is freed only after the 
 finalization happens on it.
 If the bags are not finalized fast enough, a lot of memory is consumed by the 
 finalization queue, and pig runs out of memory. This can happen if large 
 number of small bags are being created.
 *Solution:*
 The finalize function exists for the purpose of deleting the spill files that 
 are created when the bag is too large. But if the bags are small enough, no 
 spill files are created, and there is no use of the finalize function.
  A new class that holds a list of files will be introduced (FileList). This 
 class will have a finalize method that deletes the files. The bags will no 
 longer have finalize methods, and the bags will use FileList instead of 
 ArrayListFile.
 *Possible workaround for earlier releases:*
 Since the fix is going into 0.8, here is a workaround -
 Disabling the combiner will reduce the number of bags getting created, as 
 there will not be the stage of combining intermediate merge results. But I 
 would recommend disabling it only if you have this problem as it is likely to 
 slow down the query .
 To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

finalize in bag implementations causes pig to run out of memory in reduce 
--

 Key: PIG-1516
 URL: https://issues.apache.org/jira/browse/PIG-1516
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Thejas M Nair
 Fix For: 0.8.0


*Problem:*
pig bag implementations that are subclasses of DefaultAbstractBag, have 
finalize methods implemented. As a result, the garbage collector moves them to 
a finalization queue, and the memory used is freed only after the finalization 
happens on it.
If the bags are not finalized fast enough, a lot of memory is consumed by the 
finalization queue, and pig runs out of memory. This can happen if large number 
of small bags are being created.

*Solution:*
The finalize function exists for the purpose of deleting the spill files that 
are created when the bag is too large. But if the bags are small enough, no 
spill files are created, and there is no use of the finalize function.
 A new class that holds a list of files will be introduced (FileList). This 
class will have a finalize method that deletes the files. The bags will no 
longer have finalize methods, and the bags will use FileList instead of 
ArrayListFile.

*Possible workaround for earlier releases:*
Since the fix is going into 0.8, here is a workaround -
Disabling the combiner will reduce the number of bags getting created, as there 
will not be the stage of combining intermediate merge results. But I would 
recommend disabling it only if you have this problem as it is likely to slow 
down the query .
To disable combiner, set the property: -Dpig.exec.nocombiner=true


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Pig 0.8.0 branch plan

2010-07-23 Thread Olga Natkovich

Pig Developers,

 

I would like to propose that we branch for Pig 0.8.0 at the end of
August and plan for the release by the end of October. Please, let me
know if you see problem with either of the dates.

 

If you are planning to contribute any patches to Pig 0.8.0, please, make
sure that you have a JIRA open and linked to 0.8.0 release and also that
you will be able to get the code in before the branch is created. If you
have a JIRA assigned to you that is linked to Pig 0.8.0 and you don't
think you can get it in before the branch, please, unlink it from the
release.

 

Thanks,

 

Olga

[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce


[ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891784#action_12891784
 ] 

Thejas M Nair commented on PIG-1516:


Regarding the workaround - I would recommend disabling the combiner only if 
other steps such as increasing the heap size or increasing the number of 
reducers do not help.

 finalize in bag implementations causes pig to run out of memory in reduce 
 --

 Key: PIG-1516
 URL: https://issues.apache.org/jira/browse/PIG-1516
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 *Problem:*
 pig bag implementations that are subclasses of DefaultAbstractBag, have 
 finalize methods implemented. As a result, the garbage collector moves them 
 to a finalization queue, and the memory used is freed only after the 
 finalization happens on it.
 If the bags are not finalized fast enough, a lot of memory is consumed by the 
 finalization queue, and pig runs out of memory. This can happen if large 
 number of small bags are being created.
 *Solution:*
 The finalize function exists for the purpose of deleting the spill files that 
 are created when the bag is too large. But if the bags are small enough, no 
 spill files are created, and there is no use of the finalize function.
  A new class that holds a list of files will be introduced (FileList). This 
 class will have a finalize method that deletes the files. The bags will no 
 longer have finalize methods, and the bags will use FileList instead of 
 ArrayListFile.
 *Possible workaround for earlier releases:*
 Since the fix is going into 0.8, here is a workaround -
 Disabling the combiner will reduce the number of bags getting created, as 
 there will not be the stage of combining intermediate merge results. But I 
 would recommend disabling it only if you have this problem as it is likely to 
 slow down the query .
 To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword


[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891788#action_12891788
 ] 

Olga Natkovich commented on PIG-1249:
-

Ashutosh,

First, the changes are not going to be in framework till Hadoop 22 and I don't 
think we want to wait that far as we are seeing quite a few problems on our 
cluster. Second, I think we want to take a direction with pig of setting things 
up for users. Of course, we don't have stats right now to do so accurately but 
I think this is a step in the right direction

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword


[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891789#action_12891789
 ] 

Olga Natkovich commented on PIG-1249:
-

Jeff, sorry this patch did not get much attention in a while. Can I ask you to 
do the following:

(1) Regenrate the patch for the latest trunk and make sure that the tests are 
passing and we get no additional warnings
(2) Add a docs comment that describes in one place what are the exact 
heuristics, when they are applied and how they can be influenced. I will ask 
our doc writer to incorporate this information in Pig 0.8.0 documentation
(3) If it is not already done, can we log the value that will be used so that 
the user knows what is happenning

Thanks!

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-259) allow store to overwrite existing directroy


 [ 
https://issues.apache.org/jira/browse/PIG-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-259:
---

Fix Version/s: (was: 0.8.0)

Unlinking since there is no activity since early may. Jeff, please, feel free 
to link in if you still planning to work on it for 0.8 release

 allow store to overwrite existing directroy
 ---

 Key: PIG-259
 URL: https://issues.apache.org/jira/browse/PIG-259
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Jeff Zhang
 Attachments: Pig_259.patch, Pig_259_2.patch, Pig_259_3.patch, 
 Pig_259_4.patch


 we have users who are asking for a flag to overwrite existing directory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-466) PERFORMANCE: dropping the columns as soon as possible


 [ 
https://issues.apache.org/jira/browse/PIG-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-466.


Resolution: Fixed

This is already resolved as part of PIG-1178

 PERFORMANCE: dropping the columns as soon as possible
 -

 Key: PIG-466
 URL: https://issues.apache.org/jira/browse/PIG-466
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.8.0


 Currently, each operator carries all the data until foreach is encountered. 
 This can cause significant performance degradation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-498) Pig does not error out while trying to use a input file to which the user does not have access permissions


 [ 
https://issues.apache.org/jira/browse/PIG-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-498:
--

Assignee: niraj rai

I am guessing this issue might have gone away with Pig 0.7.0. Niraj, could you 
verify and if it is gone, please, close

 Pig does not error out while trying to use a input file to which the user 
 does not have access permissions
 --

 Key: PIG-498
 URL: https://issues.apache.org/jira/browse/PIG-498
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: niraj rai
 Fix For: 0.8.0


 Session illustrating the issue.
 {code}
 bash-3.00$ hadoop fs -ls /data/statistics.txt
 ls: org.apache.hadoop.fs.permission.AccessControlException: Permission 
 denied: user=username, access=READ_EXECUTE, inode=inodepermissions-
 bash-3.00$ pig -latest 
 2008-10-16 23:31:25,134 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to HOD...
 ...
 2008-10-16 23:34:45,810 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: local
 grunt a = load '/data/statistics.txt';  
 grunt dump a;
 2008-10-16 23:39:05,624 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 100% complete
 2008-10-16 23:39:05,624 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Success!
 grunt 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-348) -j command line option doesn't work


 [ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-348:


Assignee: Richard Ding  (was: Corinne Chandel)

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Richard Ding
 Fix For: 0.8.0


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-348) -j command line option doesn't work


[ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891795#action_12891795
 ] 

Richard Ding commented on PIG-348:
--

I'll first remove the -j option from source code.

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Corinne Chandel
 Fix For: 0.8.0


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1453) [zebra] Intermittent failure for TestOrderPreserveUnionHDFS

2010-07-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1453:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to the trunk.

 [zebra] Intermittent failure for TestOrderPreserveUnionHDFS
 ---

 Key: PIG-1453
 URL: https://issues.apache.org/jira/browse/PIG-1453
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1453.patch, PIG-1453.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-602) Pass global configurations to UDF


 [ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-602.


Resolution: Fixed

 Pass global configurations to UDF
 -

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han
 Fix For: 0.8.0


 We are seeking an easy way to pass a large number of global configurations to 
 UDFs.
 Since our application contains many pig jobs, and has a large number of 
 configurations. Passing configurations through command line is not an ideal 
 way (i.e. modifying single parameter needs to change multiple command lines). 
 And to put everything into the hadoop conf is not an ideal way either.
 We would like to see if Pig can provide such a facility that allows us to 
 pass a configuration file in some format(XML?) and then make it available 
 through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-602) Pass global configurations to UDF


[ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891800#action_12891800
 ] 

Olga Natkovich commented on PIG-602:


This work is already done. The user can propagate the properties via 
-propertyfile filename from the command line and the retrieve the properties 
via call to UDFContext.getJobConf. Just need to document this for Pig 0.8.0 
release

 Pass global configurations to UDF
 -

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han
 Fix For: 0.8.0


 We are seeking an easy way to pass a large number of global configurations to 
 UDFs.
 Since our application contains many pig jobs, and has a large number of 
 configurations. Passing configurations through command line is not an ideal 
 way (i.e. modifying single parameter needs to change multiple command lines). 
 And to put everything into the hadoop conf is not an ideal way either.
 We would like to see if Pig can provide such a facility that allows us to 
 pass a configuration file in some format(XML?) and then make it available 
 through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-348) -j command line option doesn't work


 [ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-348:
-

Attachment: PIG-348.path

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-348.path


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-348) -j command line option doesn't work


 [ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-348:
-

Status: Patch Available  (was: Open)

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-348.path


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1379) Jars registered from command line should override the ones present in the script


 [ 
https://issues.apache.org/jira/browse/PIG-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1379:
--

Status: Open  (was: Patch Available)

 Jars registered from command line should override the ones present in the 
 script 
 -

 Key: PIG-1379
 URL: https://issues.apache.org/jira/browse/PIG-1379
 Project: Pig
  Issue Type: Improvement
Reporter: Ankur
Assignee: Richard Ding
 Fix For: 0.8.0


 Jars that are registered from the command line when executing the pig script 
 should override the ones that are specified via 'register' in the pig script 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1379) Jars registered from command line should override the ones present in the script


 [ 
https://issues.apache.org/jira/browse/PIG-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1379:
--

Attachment: (was: PIG-1379.patch)

 Jars registered from command line should override the ones present in the 
 script 
 -

 Key: PIG-1379
 URL: https://issues.apache.org/jira/browse/PIG-1379
 Project: Pig
  Issue Type: Improvement
Reporter: Ankur
Assignee: Richard Ding
 Fix For: 0.8.0


 Jars that are registered from the command line when executing the pig script 
 should override the ones that are specified via 'register' in the pig script 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1379) Jars registered from command line should override the ones present in the script


 [ 
https://issues.apache.org/jira/browse/PIG-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1379:
--

Attachment: (was: PIG-1379.patch)

 Jars registered from command line should override the ones present in the 
 script 
 -

 Key: PIG-1379
 URL: https://issues.apache.org/jira/browse/PIG-1379
 Project: Pig
  Issue Type: Improvement
Reporter: Ankur
Assignee: Richard Ding
 Fix For: 0.8.0


 Jars that are registered from the command line when executing the pig script 
 should override the ones that are specified via 'register' in the pig script 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1379) Jars registered from command line should override the ones present in the script


[ 
https://issues.apache.org/jira/browse/PIG-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891815#action_12891815
 ] 

Richard Ding commented on PIG-1379:
---

Alan, I got your point. I now think that we should reconsider this feature 
request. It isn't clear to me why this is useful. Users can use parameter 
substitution if they don't want to change the Pig scripts. 

I moved the posted patch to PIG-348. 

 Jars registered from command line should override the ones present in the 
 script 
 -

 Key: PIG-1379
 URL: https://issues.apache.org/jira/browse/PIG-1379
 Project: Pig
  Issue Type: Improvement
Reporter: Ankur
Assignee: Richard Ding
 Fix For: 0.8.0


 Jars that are registered from the command line when executing the pig script 
 should override the ones that are specified via 'register' in the pig script 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-1379) Jars registered from command line should override the ones present in the script


 [ 
https://issues.apache.org/jira/browse/PIG-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1379.
-

Resolution: Won't Fix

This is a non-backward compatible fix and it is not clear why we need to make 
it. Parameter substitution can be used to drive execution from command line

 Jars registered from command line should override the ones present in the 
 script 
 -

 Key: PIG-1379
 URL: https://issues.apache.org/jira/browse/PIG-1379
 Project: Pig
  Issue Type: Improvement
Reporter: Ankur
Assignee: Richard Ding
 Fix For: 0.8.0


 Jars that are registered from the command line when executing the pig script 
 should override the ones that are specified via 'register' in the pig script 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-348) -j command line option doesn't work


[ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891820#action_12891820
 ] 

Olga Natkovich commented on PIG-348:


+1, changes look good

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-348.path


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-621) Casts swallow exceptions when there are issues with conversion of bytes to Pig types


 [ 
https://issues.apache.org/jira/browse/PIG-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-621:
---

Fix Version/s: 0.9.0
   (was: 0.8.0)

0.9 is all about improved error handling

 Casts swallow exceptions when there are issues with conversion of bytes to 
 Pig types
 

 Key: PIG-621
 URL: https://issues.apache.org/jira/browse/PIG-621
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
 Fix For: 0.9.0


 In the current implementation of casts, exceptions thrown while converting 
 bytes to Pig types are swallowed. Pig needs to either return NULL or rethrow 
 the exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-729) Use of default parallelism


 [ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-729.


Resolution: Duplicate

We are going with the approach outlined in PIG-1249.

 Use of default parallelism
 --

 Key: PIG-729
 URL: https://issues.apache.org/jira/browse/PIG-729
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
 Fix For: 0.8.0


 Currently, if the user does not specify the number of reduce slots using the 
 parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
 This model worked well with dynamically allocated clusters using HOD and for 
 static clusters where the default number of reduce slots was explicitly set. 
 With Hadoop 0.20, a single static cluster will be shared amongst a number of 
 queues. As a result, a common scenario is to end up with default number of 
 reducers set to one (1).
 When users migrate to Hadoop 0.20, they might see a dramatic change in the 
 performance of their queries if they had not used the parallel keyword to 
 specify the number of reducers. In order to mitigate such circumstances, Pig 
 can support one of the following:
 1. Specify a default parallelism for the entire script.
 This option will allow users to use the same parallelism for all operators 
 that do not have the explicit parallel keyword. This will ensure that the 
 scripts utilize more reducers than the default of one reducer. On the down 
 side, due to data transformations, usually operations that are performed 
 towards the end of the script will need smaller number of reducers compared 
 to the operators that appear at the beginning of the script.
 2. Display a warning message for each reduce side operator that does have the 
 use of the explicit parallel keyword. Proceed with the execution.
 3. Display an error message indicating the operator that does not have the 
 explicit use of the parallel keyword. Stop the execution.
 Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-348) -j command line option doesn't work


[ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891827#action_12891827
 ] 

Richard Ding commented on PIG-348:
--


test-patch results:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
{code}

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-348.path


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-787) Allow UDFs and their dependencies to be distributed via Hadoop's distributed cache


 [ 
https://issues.apache.org/jira/browse/PIG-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-787.


Resolution: Won't Fix

Does not look like there is reason to do this

 Allow UDFs and their dependencies to be distributed via Hadoop's distributed 
 cache
 --

 Key: PIG-787
 URL: https://issues.apache.org/jira/browse/PIG-787
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-873) Optimizer should allow search for global patterns

[
https://issues.apache.org/jira/browse/PIG-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich reassigned PIG-873:
--

Assignee: Daniel Dai

Daniel, please review with Santhosh if additional work is required. If not,
please, close. If there is more work, lets discuss if we need to do this in Pig
0.8.0. Thanks

Optimizer should allow search for global patterns
-

Key: PIG-873
URL: https://issues.apache.org/jira/browse/PIG-873
Project: Pig
Issue Type: Improvement
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
Assignee: Daniel Dai
Fix For: 0.8.0

Currently, the optimizer works on the following mechanism:
1. Specify the pattern to be searched
2. For each occurrence of the pattern, check and then apply a transformation
With this approach, the search for a pattern is localized. An example will
illustrate the problem.
If the pattern to be searched for is foreach (with flatten) connected to any
operator and if the graph has more than one foreach (with flatten) connected
to an operator (cross, join, union, etc), then each instance of foreach
connected to the operator is returned as a match. While this is fine for a
localized view (per match), at a global view the pattern to be searched for
is any number of foreach connected to an operator.
The implication of not having a globalized view is more rules. There will be
one rule for one foreach connected to an opeator, one rule for two foreachs
connected to an operators, etc.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-930) merge join should handle compressed bz2 sorted files


 [ 
https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-930:
---

Fix Version/s: (was: 0.8.0)

Unlinking from the release. We have not really seen user asks for this

 merge join should handle compressed bz2 sorted files
 

 Key: PIG-930
 URL: https://issues.apache.org/jira/browse/PIG-930
 Project: Pig
  Issue Type: Bug
Reporter: Pradeep Kamath

 There are two issues - POLoad which is used to read the right side input does 
 not handle bz2 files right now. This needs to be fixed.
 Further inn the index map job we bindTo(startOfBlockOffSet) (this will 
 internally discard first tuple if offset  0). Then we do the following:
 {noformat}
 While(tuple survives pipeline) {
   Pos =  getPosition()
   getNext() 
   run the tuple  through pipeline in the right side which could have filter
 }
 Emit(key, pos, filename).
 {noformat}
  
 Then in the map job which does the join, we bindTo(pos  0 ? pos  1 : pos) 
 (we do pos -1 because bindTo will discard first tuple for pos 0). Then we do 
 getNext()
 Now in bz2 compressed files, getPosition() returns a position which is not 
 really accurate. The problem is it could be a position in the middle of a 
 compressed bz2 block. Then when we use that position to bindTo() in the final 
 map job, the code would first hunt for a bz2 block header thus skipping the 
 whole current bz2 block. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-932) Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels


 [ 
https://issues.apache.org/jira/browse/PIG-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-932.


Resolution: Duplicate

This is duplicate of https://issues.apache.org/jira/browse/PIG-1324

 Required fields projection in Loader: nested fields in bag/tuple, map key 
 lookup more than two levels
 -

 Key: PIG-932
 URL: https://issues.apache.org/jira/browse/PIG-932
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


 To leverage the performance features provided by Zebra, Pig should be able to 
 figure out which input fields are actually used in Pig script, and prune 
 unnecessary inputs. This feature is being implementing in 
 [PIG-922|https://issues.apache.org/jira/browse/PIG-922]. However, there are 
 two limitations currently:
 1. Pruning nested fields only apply to map. We do not prune sub-field inside 
 a bag or tuple
 2. For map, currently we only go one level deep. Eg, if in Pig script, user 
 uses a#'key0'#'key1', a#'key0' will be asked
 These two limitations are in line with current limitation of Zebra loader. 
 Once Zebra loader can handle this, we need to work to lift these limitations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-947) Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.


 [ 
https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-947:
---

Fix Version/s: (was: 0.8.0)

I don't think anybody is signed up for this issue. Please, relink to the 
release if you are interested to work on it and assign to yourself.

 Parsing Bags by PigStorage is not handled correctly if whitespace before 
 start of tuple.
 

 Key: PIG-947
 URL: https://issues.apache.org/jira/browse/PIG-947
 Project: Pig
  Issue Type: Bug
  Components: data
 Environment: Pig on Hadoop 18
Reporter: Gandul Azul

 PigStorage parser for bags is not working correctly when a tuple in a bag is 
 proceeded by a space. For example, the following is parsed correctly:
 {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}
 while this is not: (Note the space before the second tuple)
 {(-5.243084,3.142401,0.000138,2.071200,0), 
 (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}
 It seems that the parser when it encounters the space, treats the rest of the 
 line as a String. With a schema, this results in a typecast of string to 
 databag which results in exception. 
 |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field 
 being converted to type bag, caught ParseException Encountered  STRING   
  at |line 1, column 43.
 |Was expecting:
 |( ...
 | field discarded
 Below is the parser debug output for the parsing of the above error sequence: 
 2.071200,0), ( from above...
 ** FOUND A DOUBLENUMBER MATCH (2.071200) **
   Call:   AtomDatum
 Consumed token: DOUBLENUMBER: 2.071200 at line 1 column 31
   Return: AtomDatum
 Return: Datum
Matched the empty string as STRING token.
 Current character : , (44) at line 1 column 39
No more string literal token matches are possible.
Currently matched the first 1 characters as a , token.
 ** FOUND A , MATCH (,) **
 Consumed token: , at line 1 column 39
 Call:   Datum
Matched the empty string as STRING token.
 Current character : 0 (48) at line 1 column 40
No string literal matches possible.
Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER 
 }
 Current character : 0 (48) at line 1 column 40
Currently matched the first 1 characters as a SIGNEDINTEGER token.
Possible kinds of longer matches : { STRING, SIGNEDINTEGER, 
 DOUBLENUMBER, LONGINTEGER, 
  FLOATNUMBER }
 Current character : ) (41) at line 1 column 41
Currently matched the first 1 characters as a SIGNEDINTEGER token.
Putting back 1 characters into the input stream.
 ** FOUND A SIGNEDINTEGER MATCH (0) **
   Call:   AtomDatum
 Consumed token: SIGNEDINTEGER: 0 at line 1 column 40
   Return: AtomDatum
 Return: Datum
Matched the empty string as STRING token.
 Current character : ) (41) at line 1 column 41
No more string literal token matches are possible.
Currently matched the first 1 characters as a ) token.
 ** FOUND A ) MATCH ()) **
   Return: Tuple
   Consumed token: ) at line 1 column 41
Matched the empty string as STRING token.
 Current character : , (44) at line 1 column 42
No more string literal token matches are possible.
Currently matched the first 1 characters as a , token.
 ** FOUND A , MATCH (,) **
   Consumed token: , at line 1 column 42
Matched the empty string as STRING token.
 Current character :   (32) at line 1 column 43
No string literal matches possible.
Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER 
 }
 Current character :   (32) at line 1 column 43
Currently matched the first 1 characters as a STRING token.
Possible kinds of longer matches : { STRING, SIGNEDINTEGER, 
 DOUBLENUMBER }
 Current character : ( (40) at line 1 column 44
Currently matched the first 1 characters as a STRING token.
Putting back 1 characters into the input stream.
 ** FOUND A STRING MATCH ( ) **
 Return: Bag
   Return: Datum
 Return: Parse

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-959) Merge Join fails when there is a blocking operator before it in query.


 [ 
https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-959:
---

Fix Version/s: (was: 0.8.0)

We are not seeing any asks for this at this time

 Merge Join fails when there is a blocking operator before it in query.
 --

 Key: PIG-959
 URL: https://issues.apache.org/jira/browse/PIG-959
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: pig-959.patch


 If there is an order-by, distinct or any other blocking operator in query 
 followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1489) Pig MapReduceLauncher does not use jars in register statement


 [ 
https://issues.apache.org/jira/browse/PIG-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1489:
--

Attachment: PIG-1489_1.patch

New patch adding the source code of the test jar.

  Pig MapReduceLauncher does not use jars in register statement 
 ---

 Key: PIG-1489
 URL: https://issues.apache.org/jira/browse/PIG-1489
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1489.patch, PIG-1489.patch, PIG-1489_1.patch


 If my Pig StorFunc has its own OutputFormat class then Pig MapReducelauncher 
 will try to instantiate it before
 launching the mapreduce job and fail with ClassNotFoundException.
 This happens because Pig MapReduce launcher uses its own classloader and 
 ignores the classes in the jars in the
 register statement.
 The effect is that the jars not only have to be in register  statement in 
 the script but also in the pig
 classpath with the -classpath tag. 
 This can be remedied by making the Pig MapReduceLauncher constructing a 
 classloader that includes the registered jars
 and using that to instantiate the OutputFormat class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1489) Pig MapReduceLauncher does not use jars in register statement


[ 
https://issues.apache.org/jira/browse/PIG-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891856#action_12891856
 ] 

Thejas M Nair commented on PIG-1489:


+1 
You can commit after verifying that tests  checks are passing.


  Pig MapReduceLauncher does not use jars in register statement 
 ---

 Key: PIG-1489
 URL: https://issues.apache.org/jira/browse/PIG-1489
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1489.patch, PIG-1489.patch, PIG-1489_1.patch


 If my Pig StorFunc has its own OutputFormat class then Pig MapReducelauncher 
 will try to instantiate it before
 launching the mapreduce job and fail with ClassNotFoundException.
 This happens because Pig MapReduce launcher uses its own classloader and 
 ignores the classes in the jars in the
 register statement.
 The effect is that the jars not only have to be in register  statement in 
 the script but also in the pig
 classpath with the -classpath tag. 
 This can be remedied by making the Pig MapReduceLauncher constructing a 
 classloader that includes the registered jars
 and using that to instantiate the OutputFormat class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1150) VAR() Variance UDF


[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891857#action_12891857
 ] 

Olga Natkovich commented on PIG-1150:
-

Dmitry, is patch ready to be committed or are you planning to submit a new one? 
Thanks

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc


[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891858#action_12891858
 ] 

Olga Natkovich commented on PIG-1205:
-

Jeff and Dmitry - are you still planning to finish this for Pig 0.8.0 release

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1489) Pig MapReduceLauncher does not use jars in register statement