[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791872#action_12791872
 ] 

Hadoop QA commented on PIG-1143:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428266/PIG_1143.patch.1
  against trunk revision 891499.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/console

This message is automatically generated.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2009-12-17 Thread Tamir Kamara (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791909#action_12791909
 ] 

Tamir Kamara commented on PIG-1150:
---

This can be very useful for me so I tested your patch but got weird results. I 
believe that the problem is at combine method - it treats the tuple as if it 
contains the original values but to my understanding it should work with the 
intermediate output and do something like this:


{code}
static protected Tuple combine(DataBag values) throws ExecException {
double sum = 0;
long count = 0;
double sumOfSquares = 0;

Tuple output = mTupleFactory.newTuple(3);

for (IteratorTuple it = values.iterator(); it.hasNext();) {
Tuple t = it.next();

sum += (Double) t.get(0);
count += (Long) t.get(1);
sumOfSquares += (Double) t.get(2);

}

output.set(0, sum);
output.set(1, count);
output.set(2, sumOfSquares);

return output;
}
{code}

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.7.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1157:
-

Assignee: Richard Ding

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1151) Date Conversion + Arithmetic UDFs

2009-12-17 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792021#action_12792021
 ] 

Dmitriy V. Ryaboy commented on PIG-1151:


Yes, these would be quite useful, please contribute!

Perhaps put them in a new package, piggybank.date


 Date Conversion + Arithmetic UDFs
 -

 Key: PIG-1151
 URL: https://issues.apache.org/jira/browse/PIG-1151
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
Reporter: sam rash
Priority: Minor

 I would like to offer up some very simple data UDFs I have that wrap JodaTime 
 (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and 
 operate on ISO8601 date strings.
 (for piggybank).  Please advise if these are appropriate.
 1. Date Arithmetic
 takes an input string: 
 2009-01-01T13:43:33.000Z
 (and partial ones such as 2009-01-02)
 and a timespan (as millis or as string shorthand)
 returns an ISO8601 string that adjusts the input date by the specified 
 timespan
 DatePlus(long timeMs); // + or - number works, is the # of millis
 DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
 DateMinus(String timespan); //propose explicit minus when using string 
 shorthand for time periods
 2. Date Comparison (when you don't have full strings that you can use string 
 compare with):
 DateIsBefore(String dateString); //true if lhs is before rhs
 DateIsAfter(String dateString); //true if lsh is after rhs
 3. date trunc functions:
 takes partial ISO8601 strings and truncates to:
 toMinute(String dateString);
 toHour(String dateString);
 toDay(String dateString);
 toWeek(String dateString);
 toMonth(String dateString);
 toYear(String dateString);
 if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1136) [zebra] Map Split of Storage info do not allow for leading underscore char '_'

2009-12-17 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1136:
-

Status: Patch Available  (was: Open)

 [zebra] Map Split of Storage info do not allow for leading underscore char '_'
 --

 Key: PIG-1136
 URL: https://issues.apache.org/jira/browse/PIG-1136
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Yan Zhou
Priority: Minor

 There is some user need to support that type of map keys. Pig's column does 
 not allow for leading underscore, but apparently no restriction is placed on 
 the map key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1136) [zebra] Map Split of Storage info do not allow for leading underscore char '_'

2009-12-17 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1136:
---

Status: Open  (was: Patch Available)

 [zebra] Map Split of Storage info do not allow for leading underscore char '_'
 --

 Key: PIG-1136
 URL: https://issues.apache.org/jira/browse/PIG-1136
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Yan Zhou
Priority: Minor

 There is some user need to support that type of map keys. Pig's column does 
 not allow for leading underscore, but apparently no restriction is placed on 
 the map key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-17 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Status: Open  (was: Patch Available)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-17 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1144:


Status: Patch Available  (was: Open)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-12-17 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-948:
---

Status: Patch Available  (was: Open)

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 PIG-948-5.patch, PIG-948-6.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-12-17 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-948:
---

Status: Open  (was: Patch Available)

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 PIG-948-5.patch, PIG-948-6.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Attachment: (was: PIG_1102.patch)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Patch Available  (was: Open)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Open  (was: Patch Available)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1157:
--

Status: Patch Available  (was: Open)

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM

2009-12-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1157:
--

Attachment: PIG-1157.patch

The problem is that, by merging a MR splittee with a FR join, the MultiQuery 
optimizer may introduce a direct cycle to the graph of the MR plan. This patch 
fixed this problem by not merging FR splitees.

This is actually stronger than necessary. A better solution would be to check 
if merging a MR splittee would form a directed cycle in the original DAG before 
merging it, and if not, allow the merge to go ahead.

 Sucessive replicated joins do not generate Map Reduce plan and fails due to 
 OOM
 ---

 Key: PIG-1157
 URL: https://issues.apache.org/jira/browse/PIG-1157
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.6.0

 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, 
 replicatedjoinexplain.log


 Hi all,
  I have a script which does 2 replicated joins in succession. Please note 
 that the inputs do not exist on the HDFS.
 {code}
 A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c);
 A1 = FOREACH A GENERATE a;
 B = GROUP A1 BY a;
 C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y);
 D = JOIN C BY x, B BY group USING replicated;
 E = JOIN A BY a, D by x USING replicated;
 dump E;
 {code}
 2009-12-16 19:12:00,253 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 4
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-only splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 1 map-reduce splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - Merged 2 out of total 2 splittees.
 2009-12-16 19:12:00,254 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. unable to create new native thread
 Details at logfile: pig_1260990666148.log
 Looking at the log file:
 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error. unable to create new native thread
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:597)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773)
 at org.apache.pig.PigServer.store(PigServer.java:522)
 at org.apache.pig.PigServer.openIterator(PigServer.java:458)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 If we want to look at the explain output, we find that there is no Map Reduce 
 plan that is generated. 
  Why is the M/R plan not generated?
 Attaching the script and explain output.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1141) Make streaming work with the new load-store interfaces

2009-12-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1141:
--

Attachment: PIG-1141.patch

This patch also removes BinaryStorage and StreamOptimizer from the source (in 
the branch).

Here is the result of locally run 'commit-patch':

{code}
[exec] -1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 32 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] -1 release audit.  The applied patch generated 433 release 
audit warnings (more than the trunk's current 430 warnings).
{code}

The release audit warnings are all 'html' related.

 Make streaming work with the new load-store interfaces 
 ---

 Key: PIG-1141
 URL: https://issues.apache.org/jira/browse/PIG-1141
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1141.patch, PIG-1141.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792288#action_12792288
 ] 

Olga Natkovich commented on PIG-1144:
-

+1; patch looks good!

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1161) Add missing apache headers to a few classes

2009-12-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792291#action_12792291
 ] 

Olga Natkovich commented on PIG-1161:
-

I will commit this patch, Thanks Dmitry

 Add missing apache headers to a few classes
 ---

 Key: PIG-1161
 URL: https://issues.apache.org/jira/browse/PIG-1161
 Project: Pig
  Issue Type: Task
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Trivial
 Fix For: 0.7.0

 Attachments: pig_missing_licenses.patch


 The following java classes are missing Apache License headers:
 StoreConfig
 MapRedUtil
 SchemaUtil
 TestDataBagAccess
 TestNullConstant
 TestSchemaUtil
 We should add the missing headers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1161) Add missing apache headers to a few classes

2009-12-17 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1161.
-

Resolution: Fixed

patch committed

 Add missing apache headers to a few classes
 ---

 Key: PIG-1161
 URL: https://issues.apache.org/jira/browse/PIG-1161
 Project: Pig
  Issue Type: Task
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Trivial
 Fix For: 0.7.0

 Attachments: pig_missing_licenses.patch


 The following java classes are missing Apache License headers:
 StoreConfig
 MapRedUtil
 SchemaUtil
 TestDataBagAccess
 TestNullConstant
 TestSchemaUtil
 We should add the missing headers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-12-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792295#action_12792295
 ] 

Olga Natkovich commented on PIG-948:


+1. Changes look good.

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 PIG-948-5.patch, PIG-948-6.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-12-17 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-948:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed.

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 PIG-948-5.patch, PIG-948-6.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792340#action_12792340
 ] 

Hadoop QA commented on PIG-948:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428337/PIG-948-6.patch
  against trunk revision 891499.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/138/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/138/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/138/console

This message is automatically generated.

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 PIG-948-5.patch, PIG-948-6.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.