[jira] Created: (PIG-1662) Need better error message for MalFormedProbVecException
Need better error message for MalFormedProbVecException --- Key: PIG-1662 URL: https://issues.apache.org/jira/browse/PIG-1662 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Instead the generic error message: Backend error message - Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.MalFormedProbVecException: ERROR 2122: Sum of probabilities should be one at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.DiscreteProbabilitySampleGenerator.init(DiscreteProbabilitySampleGenerator.java:56) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:128) ... 10 more it can easily print out the content of the malformed probability vector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1662) Need better error message for MalFormedProbVecException
[ https://issues.apache.org/jira/browse/PIG-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1662: -- Attachment: PIG-1662.patch Need better error message for MalFormedProbVecException --- Key: PIG-1662 URL: https://issues.apache.org/jira/browse/PIG-1662 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1662.patch Instead the generic error message: Backend error message - Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.MalFormedProbVecException: ERROR 2122: Sum of probabilities should be one at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.DiscreteProbabilitySampleGenerator.init(DiscreteProbabilitySampleGenerator.java:56) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:128) ... 10 more it can easily print out the content of the malformed probability vector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1662) Need better error message for MalFormedProbVecException
[ https://issues.apache.org/jira/browse/PIG-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1662: -- Status: Patch Available (was: Open) Need better error message for MalFormedProbVecException --- Key: PIG-1662 URL: https://issues.apache.org/jira/browse/PIG-1662 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1662.patch Instead the generic error message: Backend error message - Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.MalFormedProbVecException: ERROR 2122: Sum of probabilities should be one at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.DiscreteProbabilitySampleGenerator.init(DiscreteProbabilitySampleGenerator.java:56) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:128) ... 10 more it can easily print out the content of the malformed probability vector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1656) TOBAG udfs ignores columns with null value; it does not use input type to determine output schema
[ https://issues.apache.org/jira/browse/PIG-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917050#action_12917050 ] Richard Ding commented on PIG-1656: --- We need to make it clear how the output schema of TOBAG is generated. For example, in the first case, the type is preserved in the inner schema: {code} grunt a = load 'input' as (a0:int, a1:int); grunt b = foreach a generate TOBAG(a0, a1); grunt describe b; b: {{int}} {code} but not in the second case: {code} grunt a = load 'input' as (a0:int, a1:int); grunt c = group a by a0 ; grunt b = foreach c generate TOBAG(a.a0, a.a1); grunt describe b; b: {{NULL}} {code} TOBAG udfs ignores columns with null value; it does not use input type to determine output schema --- Key: PIG-1656 URL: https://issues.apache.org/jira/browse/PIG-1656 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1656.1.patch TOBAG udf ignores columns with null value {code} R4= foreach B generate $0, TOBAG( id, null, id,null ); grunt dump R4; 1000{(1),(1)} 1000{(2),(2)} 1000{(3),(3)} 1000{(4),(4)} {code} TOBAG does not use input type to determine output schema {code} grunt B1 = foreach B generate TOBAG( 1, 2, 3); grunt describe B1; B1: {{null}} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1656) TOBAG udfs ignores columns with null value; it does not use input type to determine output schema
[ https://issues.apache.org/jira/browse/PIG-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917108#action_12917108 ] Richard Ding commented on PIG-1656: --- +1 TOBAG udfs ignores columns with null value; it does not use input type to determine output schema --- Key: PIG-1656 URL: https://issues.apache.org/jira/browse/PIG-1656 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1656.1.patch, PIG-1656.2.patch TOBAG udf ignores columns with null value {code} R4= foreach B generate $0, TOBAG( id, null, id,null ); grunt dump R4; 1000{(1),(1)} 1000{(2),(2)} 1000{(3),(3)} 1000{(4),(4)} {code} TOBAG does not use input type to determine output schema {code} grunt B1 = foreach B generate TOBAG( 1, 2, 3); grunt describe B1; B1: {{null}} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1651) PIG class loading mishandled
[ https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1651: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed PIG class loading mishandled Key: PIG-1651 URL: https://issues.apache.org/jira/browse/PIG-1651 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1651.patch If just having zebra.jar as being registered in a PIG script but not in the CLASSPATH, the query using zebra fails since there appear to be multiple classes loaded into JVM, causing static variable set previously not seen after one instance of the class is created through reflection. (After the zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is as follows: ackend error message during job submission --- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://hostname/pathto/zebra_dir :: null at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) at org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) at org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework
[ https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915889#action_12915889 ] Richard Ding commented on PIG-1648: --- +1 Split combination may return too many block locations to map/reduce framework - Key: PIG-1648 URL: https://issues.apache.org/jira/browse/PIG-1648 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1648.patch For instance, if a small split has block locations h1, h2 and h3; another small split has h1, h3, h4. After combination, the composite split contains 4 block locations. If the number of component splits is big, then the number of block locations could be big too. In fact, the number of block locations serves as a hint to M/R as the best hosts this composite split should be run on so the list should contain a short list, say 5, of the hosts that contain the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1651) PIG class loading mishandled
[ https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915945#action_12915945 ] Richard Ding commented on PIG-1651: --- The problem here is that PigContext uses LogicalPlanBuilder.classloader to instantiate the LoadFuncs, but the context ClassLoader for the Thread uses a different class loader, and hence the static variable set for the class loaded by one loader is not visible by the class loaded by the other loader. The solution is to use the same LogicalPlanBuilder.classloader as the context ClassLoader for the Thread. PIG class loading mishandled Key: PIG-1651 URL: https://issues.apache.org/jira/browse/PIG-1651 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Richard Ding Fix For: 0.8.0 If just having zebra.jar as being registered in a PIG script but not in the CLASSPATH, the query using zebra fails since there appear to be multiple classes loaded into JVM, causing static variable set previously not seen after one instance of the class is created through reflection. (After the zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is as follows: ackend error message during job submission --- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://hostname/pathto/zebra_dir :: null at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) at org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) at org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1651) PIG class loading mishandled
[ https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1651: -- Status: Patch Available (was: Open) PIG class loading mishandled Key: PIG-1651 URL: https://issues.apache.org/jira/browse/PIG-1651 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1651.patch If just having zebra.jar as being registered in a PIG script but not in the CLASSPATH, the query using zebra fails since there appear to be multiple classes loaded into JVM, causing static variable set previously not seen after one instance of the class is created through reflection. (After the zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is as follows: ackend error message during job submission --- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://hostname/pathto/zebra_dir :: null at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) at org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) at org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1651) PIG class loading mishandled
[ https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1651: -- Attachment: PIG-1651.patch PIG class loading mishandled Key: PIG-1651 URL: https://issues.apache.org/jira/browse/PIG-1651 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1651.patch If just having zebra.jar as being registered in a PIG script but not in the CLASSPATH, the query using zebra fails since there appear to be multiple classes loaded into JVM, causing static variable set previously not seen after one instance of the class is created through reflection. (After the zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is as follows: ackend error message during job submission --- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://hostname/pathto/zebra_dir :: null at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) at org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) at org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1649) FRJoin fails to compute number of input files for replicated input
[ https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915985#action_12915985 ] Richard Ding commented on PIG-1649: --- +1. Looks good. FRJoin fails to compute number of input files for replicated input -- Key: PIG-1649 URL: https://issues.apache.org/jira/browse/PIG-1649 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1649.1.patch, PIG-1649.2.patch, PIG-1649.3.patch, PIG-1649.4.patch In FRJoin, if input path has curly braces, it fails to compute number of input files and logs the following exception in the log - 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of input files java.net.URISyntaxException: Illegal character in path at index 12: /user/tejas/{std*txt} at java.net.URI$Parser.fail(URI.java:2809) at java.net.URI$Parser.checkChars(URI.java:2982) at java.net.URI$Parser.parseHierarchical(URI.java:3066) at java.net.URI$Parser.parse(URI.java:3024) at java.net.URI.init(URI.java:578) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197) at org.apache.pig.PigServer.storeEx(PigServer.java:873) at org.apache.pig.PigServer.store(PigServer.java:815) at org.apache.pig.PigServer.openIterator(PigServer.java:727) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76) at org.apache.pig.Main.run(Main.java:453) at org.apache.pig.Main.main(Main.java:107) This does not cause a query to fail. But since the number of input files don't get calculated, the optimizations added in PIG-1458 to reduce load on name node will not get used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1642.patch, PIG-1642_1.patch, PIG-1642_1.patch With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1641.patch User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Attachment: PIG-1642.patch The patch passed test-core. The results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1642.patch With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Status: Patch Available (was: Open) Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1642.patch With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Attachment: PIG-1642_1.patch Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1642.patch, PIG-1642_1.patch With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Attachment: PIG-1642_1.patch Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1642.patch, PIG-1642_1.patch, PIG-1642_1.patch With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914667#action_12914667 ] Richard Ding commented on PIG-1642: --- New patch to address the review comments. Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1642.patch, PIG-1642_1.patch, PIG-1642_1.patch With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1642: - Assignee: Richard Ding Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Fix Version/s: 0.8.0 Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Status: Patch Available (was: Open) Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1641.patch User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1641: -- Attachment: PIG-1641.patch Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1641.patch User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913736#action_12913736 ] Richard Ding commented on PIG-1641: --- Hadoop counters are not available in local mode (PIG-1286). So for now I propose that, in local mode, Pig stats output is changed to something like the following: {code} Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0001 raw MAP_ONLY job_local_0002 rank_sort SAMPLER job_local_0003 rank_sort ORDER_BY Processed/user_visits_table, Input(s): Successfully read records from: Data/Raw/UserVisits.dat Output(s): Successfully stored records in: Processed/user_visits_table {code} Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1641: - Assignee: Richard Ding Incorrect counters in local mode Key: PIG-1641 URL: https://issues.apache.org/jira/browse/PIG-1641 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Ashutosh Chauhan Assignee: Richard Ding User report, not verified. email HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42ORDER_BY Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs job_local_000100000000rawMAP_ONLY job_local_000200000000rank_sort SAMPLER job_local_000300000000rank_sort ORDER_BYProcessed/user_visits_table, Input(s): Successfully read 0 records from: Data/Raw/UserVisits.dat Output(s): Successfully stored 0 records in: Processed/user_visits_table However, when I look in the output: $ ls -lh Processed/user_visits_table/CG0/ total 15250760 -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* It read a 20G input file and generated some output... /email Is it that in local mode counters are not available? If so, instead of printing zeros we should print Information Unavailable or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Summary: Order by doesn't use estimation to determine the parallelism (was: Order by doesn't use estimation to determine the paralelism) Order by doesn't use estimation to determine the parallelism Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Fix For: 0.8.0 With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1642) Order by doesn't use estimation to determine the paralelism
Order by doesn't use estimation to determine the paralelism --- Key: PIG-1642 URL: https://issues.apache.org/jira/browse/PIG-1642 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Fix For: 0.8.0 With PIG-1249, a simple heuristic is used to determine the number of reducers if it isn't specified (via PARALLEL or default_parallel). For order by statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved
[ https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912696#action_12912696 ] Richard Ding commented on PIG-1616: --- +1 'union onschema' does not use create output with correct schema when udfs are involved -- Key: PIG-1616 URL: https://issues.apache.org/jira/browse/PIG-1616 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1616.1.patch 'union onshcema' creates a merged schema based on the input schemas. It does that in the queryparser, and at that stage the udf return type used is the default return type. The actual return type for the udf is determined later in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping(). 'union onschema' should use the final type for its input relation to create the merged schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: pig-greek-test.tar Attach the test script modified based on Julien's comment. As for commend line option -g, it can also use one parameter (script file name) and let Pig determine the script engine by the file extension. Embed Pig in scripting languages Key: PIG-1479 URL: https://issues.apache.org/jira/browse/PIG-1479 Project: Pig Issue Type: New Feature Reporter: Julien Le Dem Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, pig-greek-test.tar, pig-greek.tgz It should be possible to embed Pig calls in a scripting language and let functions defined in the same script available as UDFs. This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
[ https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910407#action_12910407 ] Richard Ding commented on PIG-1615: --- This problem exists in Pig 0.7 and fixed in Pig 0.8. Return code from Pig is 0 even if the job fails when using -M flag -- Key: PIG-1615 URL: https://issues.apache.org/jira/browse/PIG-1615 Project: Pig Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script of this form, which I used inside a workflow system such as Oozie. {code} A = load '$INPUT' using PigStorage(); store A into '$OUTPUT'; {code} I run this as with Multi-query optimization turned off : {quote} $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig {quote} The directory /user/viraj/junk1 is not present I get the following results: {quote} Input(s): Failed to read data from /user/viraj/junk1 Output(s): Failed to produce result in /user/viraj/junk2 {quote} This is expected, but the return code is still 0 {code} $ echo $? 0 {code} If I run this script with Multi-query optimization turned on, it gives, a return code of 2, which is correct. {code} $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig ... $ echo $? 2 {code} I believe a wrong return code from Pig, is causing Oozie to believe that Pig script succeeded. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1610) 'union onschema' does handle some cases involving 'namespaced' column names in schema
[ https://issues.apache.org/jira/browse/PIG-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910409#action_12910409 ] Richard Ding commented on PIG-1610: --- +1 'union onschema' does handle some cases involving 'namespaced' column names in schema - Key: PIG-1610 URL: https://issues.apache.org/jira/browse/PIG-1610 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1610.1.patch, PIG-1610.2.patch case 1: grunt describe f; f: {l1::a: bytearray,l1::b: bytearray} grunt describe l1; l1: {a: bytearray,b: bytearray} grunt dump f; (1,11) (2,22) (3,33) grunt dump l1; (1,11) (2,22) (3,33) grunt u = union onschema f, l1; grunt describe u; u: {l1::a: bytearray,l1::b: bytearray} -- the dump u gives incorrect results grunt dump u; (,) (,) (,) (1,11) (2,22) (3,33) case 2: grunt u = union onschema l1, f; grunt describe u; 2010-09-13 15:11:13,877 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1108: Duplicate schema alias: l1::a Details at logfile: /Users/tejas/pig_unions_err2/trunk/pig_1284410413970.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1609) 'union onschema' should give a more useful error message when schema of one of the relations has null column name
[ https://issues.apache.org/jira/browse/PIG-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909412#action_12909412 ] Richard Ding commented on PIG-1609: --- +1 'union onschema' should give a more useful error message when schema of one of the relations has null column name - Key: PIG-1609 URL: https://issues.apache.org/jira/browse/PIG-1609 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1609.1.patch A better error message needs to be given in this case - {code} grunt l = load '/tmp/empty.bag' as (i : int); grunt f = foreach l generate i+1; grunt describe f; f: {int} grunt u = union onschema l , f; 2010-09-10 18:08:13,000 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Error merging schemas for union operator Details at logfile: /Users/tejas/pig_nmr_syn/trunk/pig_1284167020897.log {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: PIG-1479_2.patch In the previous patch, the executeScript method on ScriptPigServer returns a list of ExecJobs (one for each store statement in the script). Unfortunately, the order of ExecJobs in the list is indeterminate. This patch fixes this problem by making the executeScript method return a PigStats object. One then can retrieves the output result by the alias corresponding to store statement. Here is a example: {code} P = pig.executeScript( A = load '${input}'; ... ... store G into '${output}'; ) output = P.result(G) # an OutputStats object iter = output.iterator() if iter.hasNext(): # do something else: # do something else {code} Embed Pig in scripting languages Key: PIG-1479 URL: https://issues.apache.org/jira/browse/PIG-1479 Project: Pig Issue Type: New Feature Reporter: Julien Le Dem Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek.tgz It should be possible to embed Pig calls in a scripting language and let functions defined in the same script available as UDFs. This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: pig-greek-test.tar Attach the updated test program from Julien. To run the example: * tar -xvf pig-greek-test.tar * java -cp pig.jar:jython jar org.apache.pig.Main -x local -g script/tc.py Embed Pig in scripting languages Key: PIG-1479 URL: https://issues.apache.org/jira/browse/PIG-1479 Project: Pig Issue Type: New Feature Reporter: Julien Le Dem Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, pig-greek.tgz It should be possible to embed Pig calls in a scripting language and let functions defined in the same script available as UDFs. This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven
[ https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1562: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Thanks Niraj!. Fix the version for the dependent packages for the maven - Key: PIG-1562 URL: https://issues.apache.org/jira/browse/PIG-1562 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG-1562_1.patch, PIG-1562_2.patch, PIG_1562_0.patch We need to fix the set version so that, version is properly set for the dependent packages in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-630) provide indication that pig script only partially succeeded
[ https://issues.apache.org/jira/browse/PIG-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-630. -- Assignee: Olga Natkovich Fix Version/s: 0.8.0 Resolution: Fixed This jira has been fixed with MultiQuery optimization and Pig Stats. provide indication that pig script only partially succeeded --- Key: PIG-630 URL: https://issues.apache.org/jira/browse/PIG-630 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Currently, if you have multiple queries (stores/dumps) within the same pig script, the script return the result of the last one which does not provide sufficient information to the users. We need to provide to the user the following information: - return code that indicates the script only partioally succeeded - indication which parts have succeeded -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1589) add test cases for mapreduce operator which use distributed cache
[ https://issues.apache.org/jira/browse/PIG-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909061#action_12909061 ] Richard Ding commented on PIG-1589: --- +1 add test cases for mapreduce operator which use distributed cache - Key: PIG-1589 URL: https://issues.apache.org/jira/browse/PIG-1589 Project: Pig Issue Type: Task Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1589.1.patch, TestWordCount.jar '-files filename' can be specified in the parameters for mapreduce operator to send files to distributed cache. Need to add test cases for that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: PIG-1479.patch Thanks Julien. I rebased the patch with the latest trunk and added an option (-greek) in the Main class. Now one can run a PIG-Greek script with following command: {code} java -cp pig.jar:jython jar:hadoop config dir org.apache.pig.Main -g pig-greek script {code} or in local mode: {code} java -cp pig.jar:jython jar org.apache.pig.Main -x local -g pig-greek script {code} Embed Pig in scripting languages Key: PIG-1479 URL: https://issues.apache.org/jira/browse/PIG-1479 Project: Pig Issue Type: New Feature Reporter: Julien Le Dem Attachments: PIG-1479.patch, pig-greek.tgz It should be possible to embed Pig calls in a scripting language and let functions defined in the same script available as UDFs. This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: PIG-1548.patch Optimize scalar to consolidate the part file Key: PIG-1548 URL: https://issues.apache.org/jira/browse/PIG-1548 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1548.patch Current scalar implementation will write a scalar file onto dfs. When Pig need the scalar, it will open the dfs file directly. Each scalar file contains more than one part file though it contains only one record. This puts a huge load to namenode. We should consolidate part file before open it. Another optional step is put the consolicated file into distributed cache. This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: (was: PIG-1458.patch) Optimize scalar to consolidate the part file Key: PIG-1548 URL: https://issues.apache.org/jira/browse/PIG-1548 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1548.patch Current scalar implementation will write a scalar file onto dfs. When Pig need the scalar, it will open the dfs file directly. Each scalar file contains more than one part file though it contains only one record. This puts a huge load to namenode. We should consolidate part file before open it. Another optional step is put the consolicated file into distributed cache. This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906008#action_12906008 ] Richard Ding commented on PIG-1543: --- +1. Looks good. IsEmpty returns the wrong value after using LIMIT - Key: PIG-1543 URL: https://issues.apache.org/jira/browse/PIG-1543 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Hu Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1543-1.patch 1. Two input files: 1a: limit_empty.input_a 1 1 1 1b: limit_empty.input_b 2 2 2. The pig script: limit_empty.pig -- A contains only 1's B contains only 2's A = load 'limit_empty.input_a' as (a1:int); B = load 'limit_empty.input_a' as (b1:int); C =COGROUP A by a1, B by b1; D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), COUNT(B); store D into 'limit_empty.output/d'; -- After the script done, we see the right results: -- {(1),(1),(1)} {} 1 0 3 0 -- {} {(2),(2)} 0 1 0 2 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), COUNT(Alim), COUNT(Blim); store D1 into 'limit_empty.output/d1'; -- After the script done, we see the unexpected results: -- {(1)} {}1 1 1 0 -- {} {(2)} 1 1 0 1 dump D; dump D1; 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: The major one: IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while IsEmpty() returns correctly in limit_empty.output/d/*. The difference is that one has been applied with LIMIT before using IsEmpty(). The minor one: The redirected output only contains the first dump: ({(1),(1),(1)},{},1,0,3L,0L) ({},{(2),(2)},0,1,0L,2L) We expect two more lines like: ({(1)},{},1,1,1L,0L) ({},{(2)},1,1,0L,1L) Besides, there is error says: [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1599) pig gives generic message for few cases
[ https://issues.apache.org/jira/browse/PIG-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1599. --- Hadoop Flags: [Reviewed] Resolution: Fixed Patch is committed to both trunk and 0.8 branch. Thanks Niraj. pig gives generic message for few cases --- Key: PIG-1599 URL: https://issues.apache.org/jira/browse/PIG-1599 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Attachments: pig-1599_0.patch, pig-1599_1.patch When we run the script: register testudf.jar; a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); b = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); c = cogroup a by name, b by name; d = foreach c generate flatten(org.apache.pig.test.udf.evalfunc.BadUdf(a,b)); dump d; we get the error: now we get ERROR 2088: Unable to get results for: hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp1787360727/tmp509618997:org.apache.pig.impl.io.InterStorage. The udf is bad udf and it should throw: ERROR 2078: Caught error from UDF: org.apache.pig.test.udf.evalfunc.BadUdf, Out of bounds access [Index: 2, Size: 2] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed patch committed to both trunk and 0.8 branch. Optimize scalar to consolidate the part file Key: PIG-1548 URL: https://issues.apache.org/jira/browse/PIG-1548 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1548.patch, PIG-1548_1.patch Current scalar implementation will write a scalar file onto dfs. When Pig need the scalar, it will open the dfs file directly. Each scalar file contains more than one part file though it contains only one record. This puts a huge load to namenode. We should consolidate part file before open it. Another optional step is put the consolicated file into distributed cache. This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905744#action_12905744 ] Richard Ding commented on PIG-1334: --- Scott, Please create a new Jira for this. Another follow-up jira (PIG-1562) has already been opened. -Richard Make pig artifacts available through maven -- Key: PIG-1334 URL: https://issues.apache.org/jira/browse/PIG-1334 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: PIG-1458.patch Results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} Optimize scalar to consolidate the part file Key: PIG-1548 URL: https://issues.apache.org/jira/browse/PIG-1548 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch Current scalar implementation will write a scalar file onto dfs. When Pig need the scalar, it will open the dfs file directly. Each scalar file contains more than one part file though it contains only one record. This puts a huge load to namenode. We should consolidate part file before open it. Another optional step is put the consolicated file into distributed cache. This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904267#action_12904267 ] Richard Ding commented on PIG-1343: --- Patch is committed to the trunk. Thanks Niraj. pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1343: -- Attachment: PIG-1343_6.patch pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1343: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs
[ https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904321#action_12904321 ] Richard Ding commented on PIG-1570: --- +1. native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs - Key: PIG-1570 URL: https://issues.apache.org/jira/browse/PIG-1570 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: PIG-1570.1.patch The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. For example, even if the MR job for mapreduce operator fails, the number of jobs that failed is being reported as 0 in PigStats log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1458: -- Attachment: PIG-1458_1.patch New patch addressing review comments. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch, PIG-1458_1.patch We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1569: -- Status: Patch Available (was: Open) java properties not honored in case of properties such as stop.on.failure - Key: PIG-1569 URL: https://issues.apache.org/jira/browse/PIG-1569 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1569.patch In org.apache.pig.Main , properties are being set to default value without checking if the java system properties have been set to something else. stop.on.failure, opt.multiquery, aggregate.warning are some properties that have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1569: -- Attachment: PIG-1569.patch java properties not honored in case of properties such as stop.on.failure - Key: PIG-1569 URL: https://issues.apache.org/jira/browse/PIG-1569 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1569.patch In org.apache.pig.Main , properties are being set to default value without checking if the java system properties have been set to something else. stop.on.failure, opt.multiquery, aggregate.warning are some properties that have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904385#action_12904385 ] Richard Ding commented on PIG-1458: --- Koji, Please open a jira on increasing the replication factor of the replicated files. Now it uses the default replication factor. Thanks, -Richard aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch, PIG-1458_1.patch We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1569: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed java properties not honored in case of properties such as stop.on.failure - Key: PIG-1569 URL: https://issues.apache.org/jira/browse/PIG-1569 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1569.patch In org.apache.pig.Main , properties are being set to default value without checking if the java system properties have been set to something else. stop.on.failure, opt.multiquery, aggregate.warning are some properties that have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1458. --- Hadoop Flags: [Reviewed] Resolution: Fixed aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch, PIG-1458_1.patch We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904451#action_12904451 ] Richard Ding commented on PIG-1458: --- Patch committed to trunk. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch, PIG-1458_1.patch We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904453#action_12904453 ] Richard Ding commented on PIG-1483: --- Patch committed to trunk. [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch, PIG-1483_1.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904456#action_12904456 ] Richard Ding commented on PIG-1557: --- Patch committed to trunk. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902952#action_12902952 ] Richard Ding commented on PIG-1564: --- Hi Andrew, HDataStorage is a thin layer on top of Hadoop FileSystem. Since moving its local mode to Hadoop local mode, Pig no longer needs this layer. We intends to remove it in the feature. On Pig reading data from one file system and writing it to another, this feature is supported since Pig 0.7. -Richard add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1518. --- Hadoop Flags: [Reviewed] Resolution: Fixed Patch is committed to trunk. Thanks Yan. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1569: - Assignee: Richard Ding java properties not honored in case of properties such as stop.on.failure - Key: PIG-1569 URL: https://issues.apache.org/jira/browse/PIG-1569 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Richard Ding Fix For: 0.8.0 In org.apache.pig.Main , properties are being set to default value without checking if the java system properties have been set to something else. stop.on.failure, opt.multiquery, aggregate.warning are some properties that have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903072#action_12903072 ] Richard Ding commented on PIG-1343: --- The new patch logs NPE instead of the intended message: {code} [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null {code} pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1458: -- Attachment: PIG-1458.patch This patch uses the new multi-file-combiner (PIG-1518) to concatenate many small files for replicated join. This is based on the assumption that the total size of the replicated files should be small enough to fit into main memory. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901992#action_12901992 ] Richard Ding commented on PIG-1551: --- The typo is still there: {code} private static final Class? LONG_ARRAY_CLASS = new Long[0].getClass(); {code} It seems what you want is {code} private static final Class? LONG_ARRAY_CLASS = new long[0].getClass(); {code} so it's consistent with other array classes. This does raise a question about array parameters: the first form applies to methods like _amethod(Long[] nums)_, while the second supports methods like _amethod(long[] nums)_. And they are not exchangeable. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902030#action_12902030 ] Richard Ding commented on PIG-1343: --- The log file is created when running in batch mode, but not in interactive mode. pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902042#action_12902042 ] Richard Ding commented on PIG-1551: --- +1. I'm fine with arrays of primitive types. I can't think of a Java method that uses an array of object Long as a parameter. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Attachment: PIG-1483_1.patch New patch adding unit test. [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch, PIG-1483_1.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Status: Patch Available (was: Open) [piggybank] Add HadoopJobHistoryLoader to the piggybank --- Key: PIG-1483 URL: https://issues.apache.org/jira/browse/PIG-1483 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1483.patch, PIG-1483_1.patch PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects. The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. Here is an example that shows the intended usage: *Find all the jobs grouped by script and user:* {code} a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job; c = filter b by not (id is null); d = group c by (id, user); e = foreach d generate flatten(group), c.job; dump e; {code} A couple more examples: *Find scripts that use only the default parallelism:* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; {code} *Find the running time of each script (in seconds):* {code} a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Attachment: PIG-1557_1.patch New patch adds a unit test. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Status: Patch Available (was: Open) Hadoop Flags: [Reviewed] couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Status: Resolved (was: Patch Available) Resolution: Fixed couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch, PIG-1557_1.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Release Note: Pig now supports running scripts and registering jars that are stored in HDFS, Amazon S3, or other distributed file systems. (was: Pig now supports running scripts and registering jars that are stored in HDFS, Amazon S3, or other distributed file systems. Also added a -R parameter which allows users to specify properties in key=value form on the command line.) Remove -R option. In 0.8 Pig supports generic parameters such as -Dkey=value. support jars and scripts in dfs --- Key: PIG-1505 URL: https://issues.apache.org/jira/browse/PIG-1505 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Fix For: 0.8.0 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, pig-jars-and-scripts-from-dfs-trunk-1.patch, pig-jars-and-scripts-from-dfs-trunk-2.patch, pig-jars-and-scripts-from-dfs-trunk.patch Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901600#action_12901600 ] Richard Ding commented on PIG-1518: --- +1. The patch looks good. A few of minor points: * In PigSplit, the method add(InputSplit split) is not used and can be removed * In MapRedUtil, it would be better to not leave the debug verification code in the source code * In PigRecordReader, the code can be simplified if the initNextRecordReader() from constructor to initialize() method multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901656#action_12901656 ] Richard Ding commented on PIG-1551: --- In Invoker.java, there is a typo: {code} private static final Class? LONG_ARRAY_CLASS = new String[0].getClass(); {code} also in unPrimitivize method, this code seems unnecessary: {code} } else if (klass.equals(DOUBLE_ARRAY_CLASS)) { return DOUBLE_ARRAY_CLASS; {code} Otherwise the patch looks good. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1560) Build target 'checkstyle' fails
Build target 'checkstyle' fails --- Key: PIG-1560 URL: https://issues.apache.org/jira/browse/PIG-1560 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Giridharan Kesavan Fix For: 0.8.0 Stack trace: {code} /homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1560) Build target 'checkstyle' fails
[ https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1560: -- Description: Stack trace: {code} /trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} was: Stack trace: {code} /homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} Build target 'checkstyle' fails --- Key: PIG-1560 URL:
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Attachment: PIG-1557.patch The alias for load statement is missing. Add load alias to the alias list. couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Fix Version/s: 0.8.0 couple of issue mapping aliases to jobs --- Key: PIG-1557 URL: https://issues.apache.org/jira/browse/PIG-1557 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1557.patch I have a simple script: A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, COUNT(A); D = order C by $1; E = limit D 10; dump E; I noticed a couple of issues with alias to job mapping: neither load(A) nor limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900811#action_12900811 ] Richard Ding commented on PIG-1505: --- The results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} I'll commit the patch after running unit tests. support jars and scripts in dfs --- Key: PIG-1505 URL: https://issues.apache.org/jira/browse/PIG-1505 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, pig-jars-and-scripts-from-dfs-trunk-1.patch, pig-jars-and-scripts-from-dfs-trunk-2.patch, pig-jars-and-scripts-from-dfs-trunk.patch Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Fix Version/s: 0.8.0 Affects Version/s: 0.7.0 support jars and scripts in dfs --- Key: PIG-1505 URL: https://issues.apache.org/jira/browse/PIG-1505 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Fix For: 0.8.0 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, pig-jars-and-scripts-from-dfs-trunk-1.patch, pig-jars-and-scripts-from-dfs-trunk-2.patch, pig-jars-and-scripts-from-dfs-trunk.patch Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1334: -- Hadoop Flags: [Reviewed] Release Note: ant mvn-install :To install artifact to the local filesystem ant mvn-deploy : To deploy snapshots to the apache nexus repo (looks for authentication in the ~/.m2/settings.xml) ant mvn-deploy -Drepo=staging :To deploy artifacts for voting before release , this also requires authentication configured in ~/.m2/settings.xml Deploying artifacts to the staging repository requires signing the artifacts with gpg keys, mvn-deploy target takes care of signing the artifacts. While executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase which need to be keyed in. Once the deployment is successful, to make the artifact available in the staging repository , login into the staging repository and close the staging by right clicking on the staged artifact at http:/repository.apache.org was: ant mvn-install :To install artifact to the local filesystem ant mvn-deploy : To deploy snapshots to the apache nexus repo (looks for authentication in the ~/.m2/settings.xml) ant mvn-deploy -Drepo=staging :To deploy artifacts for voting before release , this also requires authentication configured in ~/.m2/settings.xml Deploying artifacts to the staging repository requires signing the artifacts with gpg keys, mvn-deploy target takes care of signing the artifacts. While executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase which need to be keyed in. Once the deployment is successful, to make the artifact available in the staging repository , login into the staging repository and close the staging by right clicking on the staged artifact at http:/repository.apache.org With this patch I have already uploaded artifacts to the stating repository; (only ppl with committer access would be able to view this, as the repository is not closed yet) Make pig artifacts available through maven -- Key: PIG-1334 URL: https://issues.apache.org/jira/browse/PIG-1334 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1334: -- Status: Resolved (was: Patch Available) Resolution: Fixed The patch is committed to the trunk. Thanks Niraj for making this feature available. Make pig artifacts available through maven -- Key: PIG-1334 URL: https://issues.apache.org/jira/browse/PIG-1334 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed All core tests passed. The patch is committed to the trunk. Thanks Andrew for contributing this feature! support jars and scripts in dfs --- Key: PIG-1505 URL: https://issues.apache.org/jira/browse/PIG-1505 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Fix For: 0.8.0 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, pig-jars-and-scripts-from-dfs-trunk-1.patch, pig-jars-and-scripts-from-dfs-trunk-2.patch, pig-jars-and-scripts-from-dfs-trunk.patch Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900376#action_12900376 ] Richard Ding commented on PIG-1514: --- Patch looks good. A couple of comments: * It would be better to refactor the graph manipulation code into a helper class so that the graph transformation routines (such as swap, insert, remove, replace, ...) can be shared by all rules. * Please remove tabs from the file. Migrate logical optimization rule: OpLimitOptimizer --- Key: PIG-1514 URL: https://issues.apache.org/jira/browse/PIG-1514 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1514-0.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900518#action_12900518 ] Richard Ding commented on PIG-1334: --- The new output is at https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/ Make pig artifacts available through maven -- Key: PIG-1334 URL: https://issues.apache.org/jira/browse/PIG-1334 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Status: Resolved (was: Patch Available) Resolution: Fixed to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, PIG-1452V4.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1497) Mandatory rule PartitionFilterOptimizer
[ https://issues.apache.org/jira/browse/PIG-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900100#action_12900100 ] Richard Ding commented on PIG-1497: --- Looks good. A few comments: In _PartitionFilterPushDown_: * In _check_ method, why changes the condition from {code} if(... || sucs.size() != 1 || ...) { {code} to {code} if(... || succeds.size() == 0 || ...) {code} * In _transform_ method, the original code {code} // remove this filter from the plan mPlan.removeAndReconnect(loFilter); {code} is replaced by its own implementation. It seems better to also migrate the _removeAndReconnect_ to the new _OperatorPlan_ since the logic there is more complicated (keeping the order of connections). * The javadoc for the class isn't migrated. * Several variables (e.g. loadFunc, loLoad, loFilter, ...) now have scope within the _PartitionFilterPushDownTransformer_ class, so it would be better to put them inside the transformer class. In addition, * Need to remove all the tabs from the files and replace them with 4 spaces. * Several unit tests now fail due to the dependency on other jiras. Mandatory rule PartitionFilterOptimizer --- Key: PIG-1497 URL: https://issues.apache.org/jira/browse/PIG-1497 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.8.0 Reporter: Daniel Dai Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1497-0.patch Need to migrate PartitionFilterOptimizer to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Attachment: PIG-1452V4.PATCH New patch fixing the contrib projects. to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, PIG-1452V4.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Status: Open (was: Patch Available) to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, PIG-1452V4.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Status: Patch Available (was: Open) to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, PIG-1452V4.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899631#action_12899631 ] Richard Ding commented on PIG-1452: --- The target buildJar-withouthadoop doesn't depend on hadoop20.jar so this change doesn't affect this target. to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, PIG-1452V4.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1392: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed The parser bug is fixed, but encounters another problem which is tracked by PIG-1545. The work around is to disable the secondary key optimization. The patch is committed to the trunk. Parser fails to recognize valid field - Key: PIG-1392 URL: https://issues.apache.org/jira/browse/PIG-1392 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: niraj rai Fix For: 0.8.0 Attachments: nested_parser.patch Using this script below, parser fails to recognize a valid field in the relation and throws error A = LOAD '/tmp' as (a:int, b:chararray, c:int); B = GROUP A BY (a, b); C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; The error thrown is 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899003#action_12899003 ] Richard Ding commented on PIG-1392: --- Thanks Niraj for fixing this issue. Parser fails to recognize valid field - Key: PIG-1392 URL: https://issues.apache.org/jira/browse/PIG-1392 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: niraj rai Fix For: 0.8.0 Attachments: nested_parser.patch Using this script below, parser fails to recognize a valid field in the relation and throws error A = LOAD '/tmp' as (a:int, b:chararray, c:int); B = GROUP A BY (a, b); C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; The error thrown is 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899053#action_12899053 ] Richard Ding commented on PIG-1334: --- bq. 2. This jar is 11MB and includes a bunch of dependencies, many of which are optional: We should deploy _pig-0.8.0-SNAPSHOT-core.jar (which contains only Pig classes) instead of _pig-0.8.0-SNAPSHOT.jar_ (which also contains dependent jars). Make pig artifacts available through maven -- Key: PIG-1334 URL: https://issues.apache.org/jira/browse/PIG-1334 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: niraj rai Fix For: 0.8.0 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, mvn_pig_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Attachment: PIG-1452_3.patch I resynced the patch with the trunk and the size of pig.jar now is about 8M. to remove hadoop20.jar from lib and use hadoop from the apache maven repo. -- Key: PIG-1452 URL: https://issues.apache.org/jira/browse/PIG-1452 Project: Pig Issue Type: Improvement Components: build Affects Versions: 0.8.0 Reporter: Giridharan Kesavan Assignee: Giridharan Kesavan Fix For: 0.8.0 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH pig use ivy for dependency management. But still it uses hadoop20.jar from the lib folder. Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Attachment: PIG-1541_1.patch New patch to address the general case where the join key is tuple. FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1541.patch, PIG-1541_1.patch Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898450#action_12898450 ] Richard Ding commented on PIG-1448: --- +1. Looks good. Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: multi_oom_filt.pig, PIG-1448.1.patch This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897866#action_12897866 ] Richard Ding commented on PIG-1541: --- Results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to i [exec] nclude 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1541.patch Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897451#action_12897451 ] Richard Ding commented on PIG-1458: --- The proposal is to run another map-reduce job to merge the small files before the replicated join. This additional job will be added to the MR plan at the compile time. We consider three cases of a replicated join: # The right input is a map-only job and input files exist at the compile time. # The right input is a map-only job and input files do not exist at the compile time. # The right input is a map-reduce job. For 1., if the number of files exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job. For 3., if the number of reducers exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job. For 2., if the flag specified in the property file (_pig.frjoin.merge.files.optimistic_) is false, a merge job is added between right input job and FR join job. The default value of this flag is false. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable
[ https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-103: - Tags: documentation Shared Job /tmp location should be configurable --- Key: PIG-103 URL: https://issues.apache.org/jira/browse/PIG-103 Project: Pig Issue Type: Improvement Components: impl Environment: Partially shared file:// filesystem (eg NFS) Reporter: Craig Macdonald Assignee: niraj rai Fix For: 0.8.0 Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch Hello, I'm investigating running pig in an environment where various parts of the file:// filesystem are available on all nodes. I can tell hadoop to use a file:// file system location for it's default, by seting fs.default.name=file://path/to/shared/folder However, this creates issues for Pig, as Pig writes it's job information in a folder that it assumes is a shared FS (eg DFS). However, in this scenario /tmp is not shared on each machine. So /tmp should either be configurable, or Hadoop should tell you the actual full location set in fs.default.name? Straightforward solution is to make /tmp/ a property in src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext) Any suggestions of property names? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897484#action_12897484 ] Richard Ding commented on PIG-1458: --- For 1. and 2. above, another approach is to do nothing and rely on MultiFileInputFormat (PIG-1518) to merge small files. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.