[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791872#action_12791872 ] Hadoop QA commented on PIG-1143: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428266/PIG_1143.patch.1 against trunk revision 891499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/console This message is automatically generated. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791909#action_12791909 ] Tamir Kamara commented on PIG-1150: --- This can be very useful for me so I tested your patch but got weird results. I believe that the problem is at combine method - it treats the tuple as if it contains the original values but to my understanding it should work with the intermediate output and do something like this: {code} static protected Tuple combine(DataBag values) throws ExecException { double sum = 0; long count = 0; double sumOfSquares = 0; Tuple output = mTupleFactory.newTuple(3); for (IteratorTuple it = values.iterator(); it.hasNext();) { Tuple t = it.next(); sum += (Double) t.get(0); count += (Long) t.get(1); sumOfSquares += (Double) t.get(2); } output.set(0, sum); output.set(1, count); output.set(2, sumOfSquares); return output; } {code} VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.7.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1157: - Assignee: Richard Ding Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.6.0 Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING replicated; E = JOIN A BY a, D by x USING replicated; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1151) Date Conversion + Arithmetic UDFs
[ https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792021#action_12792021 ] Dmitriy V. Ryaboy commented on PIG-1151: Yes, these would be quite useful, please contribute! Perhaps put them in a new package, piggybank.date Date Conversion + Arithmetic UDFs - Key: PIG-1151 URL: https://issues.apache.org/jira/browse/PIG-1151 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Reporter: sam rash Priority: Minor I would like to offer up some very simple data UDFs I have that wrap JodaTime (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate on ISO8601 date strings. (for piggybank). Please advise if these are appropriate. 1. Date Arithmetic takes an input string: 2009-01-01T13:43:33.000Z (and partial ones such as 2009-01-02) and a timespan (as millis or as string shorthand) returns an ISO8601 string that adjusts the input date by the specified timespan DatePlus(long timeMs); // + or - number works, is the # of millis DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc DateMinus(String timespan); //propose explicit minus when using string shorthand for time periods 2. Date Comparison (when you don't have full strings that you can use string compare with): DateIsBefore(String dateString); //true if lhs is before rhs DateIsAfter(String dateString); //true if lsh is after rhs 3. date trunc functions: takes partial ISO8601 strings and truncates to: toMinute(String dateString); toHour(String dateString); toDay(String dateString); toWeek(String dateString); toMonth(String dateString); toYear(String dateString); if any/all are helpful, I'm happy to contribute to pig -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1136) [zebra] Map Split of Storage info do not allow for leading underscore char '_'
[ https://issues.apache.org/jira/browse/PIG-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1136: - Status: Patch Available (was: Open) [zebra] Map Split of Storage info do not allow for leading underscore char '_' -- Key: PIG-1136 URL: https://issues.apache.org/jira/browse/PIG-1136 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Priority: Minor There is some user need to support that type of map keys. Pig's column does not allow for leading underscore, but apparently no restriction is placed on the map key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1136) [zebra] Map Split of Storage info do not allow for leading underscore char '_'
[ https://issues.apache.org/jira/browse/PIG-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1136: --- Status: Open (was: Patch Available) [zebra] Map Split of Storage info do not allow for leading underscore char '_' -- Key: PIG-1136 URL: https://issues.apache.org/jira/browse/PIG-1136 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Yan Zhou Priority: Minor There is some user need to support that type of map keys. Pig's column does not allow for leading underscore, but apparently no restriction is placed on the map key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1144: Status: Open (was: Patch Available) set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.6.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1144: Status: Patch Available (was: Open) set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.6.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-948: --- Status: Patch Available (was: Open) [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, PIG-948-5.patch, PIG-948-6.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-948: --- Status: Open (was: Patch Available) [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, PIG-948-5.patch, PIG-948-6.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Attachment: (was: PIG_1102.patch) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Patch Available (was: Open) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Open (was: Patch Available) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1157: -- Status: Patch Available (was: Open) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.6.0 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, replicatedjoinexplain.log Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING replicated; E = JOIN A BY a, D by x USING replicated; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1157: -- Attachment: PIG-1157.patch The problem is that, by merging a MR splittee with a FR join, the MultiQuery optimizer may introduce a direct cycle to the graph of the MR plan. This patch fixed this problem by not merging FR splitees. This is actually stronger than necessary. A better solution would be to check if merging a MR splittee would form a directed cycle in the original DAG before merging it, and if not, allow the merge to go ahead. Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.6.0 Attachments: oomreplicatedjoin.pig, PIG-1157.patch, replicatedjoinexplain.log Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING replicated; E = JOIN A BY a, D by x USING replicated; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1141) Make streaming work with the new load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1141: -- Attachment: PIG-1141.patch This patch also removes BinaryStorage and StreamOptimizer from the source (in the branch). Here is the result of locally run 'commit-patch': {code} [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 32 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 433 release audit warnings (more than the trunk's current 430 warnings). {code} The release audit warnings are all 'html' related. Make streaming work with the new load-store interfaces --- Key: PIG-1141 URL: https://issues.apache.org/jira/browse/PIG-1141 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1141.patch, PIG-1141.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792288#action_12792288 ] Olga Natkovich commented on PIG-1144: - +1; patch looks good! set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.6.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch, PIG-1144-4.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1161) Add missing apache headers to a few classes
[ https://issues.apache.org/jira/browse/PIG-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792291#action_12792291 ] Olga Natkovich commented on PIG-1161: - I will commit this patch, Thanks Dmitry Add missing apache headers to a few classes --- Key: PIG-1161 URL: https://issues.apache.org/jira/browse/PIG-1161 Project: Pig Issue Type: Task Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Trivial Fix For: 0.7.0 Attachments: pig_missing_licenses.patch The following java classes are missing Apache License headers: StoreConfig MapRedUtil SchemaUtil TestDataBagAccess TestNullConstant TestSchemaUtil We should add the missing headers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1161) Add missing apache headers to a few classes
[ https://issues.apache.org/jira/browse/PIG-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich resolved PIG-1161. - Resolution: Fixed patch committed Add missing apache headers to a few classes --- Key: PIG-1161 URL: https://issues.apache.org/jira/browse/PIG-1161 Project: Pig Issue Type: Task Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Trivial Fix For: 0.7.0 Attachments: pig_missing_licenses.patch The following java classes are missing Apache License headers: StoreConfig MapRedUtil SchemaUtil TestDataBagAccess TestNullConstant TestSchemaUtil We should add the missing headers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792295#action_12792295 ] Olga Natkovich commented on PIG-948: +1. Changes look good. [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, PIG-948-5.patch, PIG-948-6.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-948: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed. [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, PIG-948-5.patch, PIG-948-6.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs
[ https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792340#action_12792340 ] Hadoop QA commented on PIG-948: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428337/PIG-948-6.patch against trunk revision 891499. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/138/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/138/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/138/console This message is automatically generated. [Usability] Relating pig script with MR jobs Key: PIG-948 URL: https://issues.apache.org/jira/browse/PIG-948 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Ashutosh Chauhan Assignee: Daniel Dai Priority: Minor Fix For: 0.7.0 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, PIG-948-5.patch, PIG-948-6.patch, pig-948.patch Currently its hard to find a way to relate pig script with specific MR job. In a loaded cluster with multiple simultaneous job submissions, its not easy to figure out which specific MR jobs were launched for a given pig script. If Pig can provide this info, it will be useful to debug and monitor the jobs resulting from a pig script. At the very least, Pig should be able to provide user the following information 1) Job id of the launched job. 2) Complete web url of jobtracker running this job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.