[jira] Commented: (PIG-1661) Add alternative search-provider to Pig site
[ https://issues.apache.org/jira/browse/PIG-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917246#action_12917246 ] Santhosh Srinivasan commented on PIG-1661: -- Sure, worth a try. Add alternative search-provider to Pig site --- Key: PIG-1661 URL: https://issues.apache.org/jira/browse/PIG-1661 Project: Pig Issue Type: Improvement Components: documentation Reporter: Alex Baranau Priority: Minor Attachments: PIG-1661.patch Use search-hadoop.com service to make available search in Pig sources, MLs, wiki, etc. This was initially proposed on user mailing list. The search service was already added in site's skin (common for all Hadoop related projects) via AVRO-626 so this issue is about enabling it for Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage
PigStorage should be able to read back complex data containing delimiters created by PigStorage --- Key: PIG-1344 URL: https://issues.apache.org/jira/browse/PIG-1344 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Santhosh Srinivasan Assignee: Daniel Dai Fix For: 0.8.0 With Pig 0.7, the TextDataParser has been removed and the logic to parse complex data types has moved to Utf8StorageConverter. However, this does not handle the case where the complex data types could contain delimiters ('{', '}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage self contained and more usable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850342#action_12850342 ] Santhosh Srinivasan commented on PIG-1331: -- Jay, In PIG-823 there was a discussion around how Owl is different from Hive's metastore. Is that still true today? If not, can you elaborate on the key differences between the two systems? Thanks, Santhosh Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Reporter: Jay Tang This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables
[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798917#action_12798917 ] Santhosh Srinivasan commented on PIG-1117: -- +1 on making it part of main piggybank. We should not be creating a separate directory just to handle hive. Pig reading hive columnar rc tables --- Key: PIG-1117 URL: https://issues.apache.org/jira/browse/PIG-1117 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Fix For: 0.7.0 Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, PIG-1117.patch, PIG-117-v.0.6.0.patch, PIG-117-v.0.7.0.patch I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's
[ https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775968#action_12775968 ] Santhosh Srinivasan commented on PIG-1065: -- The schema will then correspond to the prefix as it is implemented today. For example if the AS statement is define for the flatten($1) and if $1 flattens to 10 columns and if the AS clause has 3 columns then the prefix is used and the remaining are left undefined. In-determinate behaviour of Union when there are 2 non-matching schema's Key: PIG-1065 URL: https://issues.apache.org/jira/browse/PIG-1065 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a script which first does a union of these schemas and then does a ORDER BY of this result. {code} f1 = LOAD '1.txt' as (key:chararray, v:chararray); f2 = LOAD '2.txt' as (key:chararray); u0 = UNION f1, f2; describe u0; dump u0; u1 = ORDER u0 BY $0; dump u1; {code} When I run in Map Reduce mode I get the following result: $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig Schema for u0 unknown. (1,2) (2,3) (1) (2) org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias u1 at org.apache.pig.PigServer.openIterator(PigServer.java:475) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) When I run the same script in local mode I get a different result, as we know that local mode does not use any Hadoop Classes. $java -cp pig.jar org.apache.pig.Main -x local broken.pig Schema for u0 unknown (1,2) (1) (2,3) (2) (1,2) (1) (2,3) (2) Here are some questions 1) Why do we allow union if the schemas do not match 2) Should we not print an error message/warning so that the user knows that this is not allowed or he can get unexpected results? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's
[ https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776098#action_12776098 ] Santhosh Srinivasan commented on PIG-1065: -- bq. Aliasing inside foreach is hugely useful for readability. Are you suggesting removing the ability to assign aliases inside a forearch, or just to change/assign schemas? For consistency, all relational operators should support the AS clause. Gradually, the aliasing on a per column basis in foreach should be removed from the documentation, deprecated and eventually removed. This is a long term recommendation. In-determinate behaviour of Union when there are 2 non-matching schema's Key: PIG-1065 URL: https://issues.apache.org/jira/browse/PIG-1065 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a script which first does a union of these schemas and then does a ORDER BY of this result. {code} f1 = LOAD '1.txt' as (key:chararray, v:chararray); f2 = LOAD '2.txt' as (key:chararray); u0 = UNION f1, f2; describe u0; dump u0; u1 = ORDER u0 BY $0; dump u1; {code} When I run in Map Reduce mode I get the following result: $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig Schema for u0 unknown. (1,2) (2,3) (1) (2) org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias u1 at org.apache.pig.PigServer.openIterator(PigServer.java:475) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) When I run the same script in local mode I get a different result, as we know that local mode does not use any Hadoop Classes. $java -cp pig.jar org.apache.pig.Main -x local broken.pig Schema for u0 unknown (1,2) (1) (2,3) (2) (1,2) (1) (2,3) (2) Here are some questions 1) Why do we allow union if the schemas do not match 2) Should we not print an error message/warning so that the user knows that this is not allowed or he can get unexpected results? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1073) LogicalPlanCloner can't clone plan containing LOJoin
[ https://issues.apache.org/jira/browse/PIG-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774147#action_12774147 ] Santhosh Srinivasan commented on PIG-1073: -- If my memory serves me correctly, the logical plan cloning was implemented (by me) for cloning inner plans for foreach. As such, the top level plan cloning was never tested and some items are marked as TODO (see visit methods for LOLoad, LOStore and LOStream). If you want to use it as you mention in your test cases, then you need to add code for cloning the LOLoad, LOStore, LOStream and LOJoin operators. LogicalPlanCloner can't clone plan containing LOJoin Key: PIG-1073 URL: https://issues.apache.org/jira/browse/PIG-1073 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Add following testcase in LogicalPlanBuilder.java public void testLogicalPlanCloner() throws CloneNotSupportedException{ LogicalPlan lp = buildPlan(C = join ( load 'A') by $0, (load 'B') by $0;); LogicalPlanCloner cloner = new LogicalPlanCloner(lp); cloner.getClonedPlan(); } and this fails with the following stacktrace: java.lang.NullPointerException at org.apache.pig.impl.logicalLayer.LOVisitor.visit(LOVisitor.java:171) at org.apache.pig.impl.logicalLayer.PlanSetter.visit(PlanSetter.java:63) at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:213) at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.getClonedPlan(LogicalPlanCloneHelper.java:73) at org.apache.pig.impl.logicalLayer.LogicalPlanCloner.getClonedPlan(LogicalPlanCloner.java:46) at org.apache.pig.test.TestLogicalPlanBuilder.testLogicalPlanCloneHelper(TestLogicalPlanBuilder.java:2110) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's
[ https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774153#action_12774153 ] Santhosh Srinivasan commented on PIG-1065: -- Answer to Question 1: Pig 1.0 had that syntax and it was retained for backward compatibility. Paolo suggested that for uniformity, the 'AS' clause for the load statements should be extended to all relational operators. Gradually, the column aliasing in the foreach should be removed from the documentation and eventually removed from the language. In-determinate behaviour of Union when there are 2 non-matching schema's Key: PIG-1065 URL: https://issues.apache.org/jira/browse/PIG-1065 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a script which first does a union of these schemas and then does a ORDER BY of this result. {code} f1 = LOAD '1.txt' as (key:chararray, v:chararray); f2 = LOAD '2.txt' as (key:chararray); u0 = UNION f1, f2; describe u0; dump u0; u1 = ORDER u0 BY $0; dump u1; {code} When I run in Map Reduce mode I get the following result: $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig Schema for u0 unknown. (1,2) (2,3) (1) (2) org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias u1 at org.apache.pig.PigServer.openIterator(PigServer.java:475) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) When I run the same script in local mode I get a different result, as we know that local mode does not use any Hadoop Classes. $java -cp pig.jar org.apache.pig.Main -x local broken.pig Schema for u0 unknown (1,2) (1) (2,3) (2) (1,2) (1) (2,3) (2) Here are some questions 1) Why do we allow union if the schemas do not match 2) Should we not print an error message/warning so that the user knows that this is not allowed or he can get unexpected results? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771287#action_12771287 ] Santhosh Srinivasan commented on PIG-1016: -- I am summarizing my understanding of the patch that has been submitted by hc busy. Root cause: PIG-880 changed the value type of maps in PigStorage from native Java types to DataByteArray. As a result of this change, parsing of complex types as map values was disabled. Proposed fix: Revert the changes made as part of PIG-880 to interpret map values as Java types. In addition, change the comparison method to check for the object type and call the appropriate compareTo method. The latter is required to workaround the fact that the front-end assigns the value type to be DataByteArray whereas the backend sees the actual type (Integer, Long, Tuple, DataBag, etc.) Based on this understanding I have the following review comment(s). Index: src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java === Can you explain the checks in the if and the else? Specifically, NullableBytesWritable is a subclass of PigNullableWritable. As a result, in the if part, the check for both o1 and o2 not being PigNullableWritable is confusing as nbw1 and nbw2 are cast to NullableBytesWritable if o1 and o2 are not PigNullableWritable. {code} +// find bug is complaining about nulls. This check sequence will prevent nulls from being dereferenced. +if(o1!=null o2!=null){ + +// In case the objects are comparable +if((o1 instanceof NullableBytesWritable o2 instanceof NullableBytesWritable)|| + !(o1 instanceof PigNullableWritable o2 instanceof PigNullableWritable) +){ + + NullableBytesWritable nbw1 = (NullableBytesWritable)o1; + NullableBytesWritable nbw2 = (NullableBytesWritable)o2; + + // If either are null, handle differently. + if (!nbw1.isNull() !nbw2.isNull()) { + rc = ((DataByteArray)nbw1.getValueAsPigType()).compareTo((DataByteArray)nbw2.getValueAsPigType()); + } else { + // For sorting purposes two nulls are equal. + if (nbw1.isNull() nbw2.isNull()) rc = 0; + else if (nbw1.isNull()) rc = -1; + else rc = 1; + } +}else{ + // enter here only if both o1 and o2 are non-NullableByteWritable PigNullableWritable's + PigNullableWritable nbw1 = (PigNullableWritable)o1; + PigNullableWritable nbw2 = (PigNullableWritable)o2; + // If either are null, handle differently. + if (!nbw1.isNull() !nbw2.isNull()) { + rc = nbw1.compareTo(nbw2); + } else { + // For sorting purposes two nulls are equal. + if (nbw1.isNull() nbw2.isNull()) rc = 0; + else if (nbw1.isNull()) rc = -1; + else rc = 1; + } +} +}else{ + if(o1==null o2==null){rc=0;} + else if(o1==null) {rc=-1;} + else{ rc=1; } {code} Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.5.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1056) table can not be loaded after store
[ https://issues.apache.org/jira/browse/PIG-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770743#action_12770743 ] Santhosh Srinivasan commented on PIG-1056: -- Do you have the right load statement? I don't see the using clause that specifies the zebra loader. table can not be loaded after store --- Key: PIG-1056 URL: https://issues.apache.org/jira/browse/PIG-1056 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Pig Stack Trace --- ERROR 1018: Problem determining schema during load org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Problem determining schema during load at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1023) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967) at org.apache.pig.PigServer.registerQuery(PigServer.java:383) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem determining schema during load at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017) ... 8 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: Problem determining schema during load at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732) ... 10 more Caused by: java.io.IOException: No table specified for input at org.apache.hadoop.zebra.pig.TableLoader.checkConf(TableLoader.java:238) at org.apache.hadoop.zebra.pig.TableLoader.determineSchema(TableLoader.java:258) at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148) ... 11 more ~ script: register /grid/0/dev/hadoopqa/hadoop/lib/zebra.jar; A = load 'filter.txt' as (name:chararray, age:int); B = filter A by age 20; --dump B; store B into 'filter1' using org.apache.hadoop.zebra.pig.TableStorer('[name];[age]'); rec1 = load 'B' using org.apache.hadoop.zebra.pig.TableLoader(); dump rec1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class
[ https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768368#action_12768368 ] Santhosh Srinivasan commented on PIG-1012: -- I just looked at the first patch. It was setting generate to true in TestMRCompiler.java It should be set to false in order to run the test case correctly. +++ test/org/apache/pig/test/TestMRCompiler.java -private boolean generate = false; +private boolean generate = true; FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class --- Key: PIG-1012 URL: https://issues.apache.org/jira/browse/PIG-1012 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Attachments: PIG-1012-2.patch, PIG-1012.patch SeClass org.apache.pig.backend.executionengine.PigSlice defines non-transient non-serializable instance field is SeClass org.apache.pig.backend.executionengine.PigSlice defines non-transient non-serializable instance field loader Sejava.util.zip.GZIPInputStream stored into non-transient field PigSlice.is Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into non-transient field PigSlice.is Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient field PigSlice.is Seorg.apache.pig.builtin.PigStorage stored into non-transient field PigSlice.loader Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator implements Comparator but not Serializable Se org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator implements Comparator but not Serializable SeClass org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper defines non-transient non-serializable instance field nig SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LTOrEqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.NotEqualToExpr defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject defines non-transient non-serializable instance field bagIterator SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserComparisonFunc defines non-transient non-serializable instance field log SeClass org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc defines non-transient non-serializable instance field log
[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records
[ https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765779#action_12765779 ] Santhosh Srinivasan commented on PIG-1014: -- Another option is to change the implementation of COUNT to reflect the proposed semantics. If the underlying UDF is changed then the user should be notified via an information message. If the user checks the explain output then (s)he will notice COUNT_STAR and will be confused. Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records Key: PIG-1014 URL: https://issues.apache.org/jira/browse/PIG-1014 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records
[ https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765194#action_12765194 ] Santhosh Srinivasan commented on PIG-1014: -- Essentially, Pradeep is pointing out an issue in the implementation of COUNT. If that is the case then COUNT has to be fixed or the semantics of COUNT has to be documented to explain the current implementation. I would vote for fixing COUNT to have the correct semantics. Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records Key: PIG-1014 URL: https://issues.apache.org/jira/browse/PIG-1014 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records
[ https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765357#action_12765357 ] Santhosh Srinivasan commented on PIG-1014: -- After a discussion with Pradeep who also graciously ran SQL queries to verify semantics, we have the following proposal: The semantics of COUNT could be defined as: 1. COUNT( A ) is equivalent to COUNT( A.* ) and the result of COUNT( A ) will count null tuples in the relation 2. COUNT( A.$0) will not count null tuples in the relation 3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation containing tuples with two columns and will exhibit the behavior of statement 1 OR 3. COUNT(A.($0, $1)) is equivalent to COUNT( A1.* ) where A1 is the relation containing tuples with two columns and will exhibit the behavior of statement 2 Point 3 needs more discussion. Comments/thoughts/suggestions/anything else welcome. Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records Key: PIG-1014 URL: https://issues.apache.org/jira/browse/PIG-1014 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records
[ https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764771#action_12764771 ] Santhosh Srinivasan commented on PIG-1014: -- Is Pig trying to guess the user's intent? What if the user wanted to do count without nulls ? Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records Key: PIG-1014 URL: https://issues.apache.org/jira/browse/PIG-1014 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records
[ https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764792#action_12764792 ] Santhosh Srinivasan commented on PIG-1014: -- If the user wants to count without nulls then the user should use COUNT_STAR. One of the philosophies of Pig has been to allow users to do exactly what they want. Here, we are violating that philosophy and secondly we are second guessing the user's intention. Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records Key: PIG-1014 URL: https://issues.apache.org/jira/browse/PIG-1014 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764846#action_12764846 ] Santhosh Srinivasan commented on PIG-984: - Very quick comment. The parser has a log.info which should be converted to a log.debug Index: src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt === +[USING (\collected\ { +log.info(Using mapside); PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-984.patch, PIG-984_1.patch The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1014) Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records
[ https://issues.apache.org/jira/browse/PIG-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764368#action_12764368 ] Santhosh Srinivasan commented on PIG-1014: -- When the semantics of COUNT was changed, I thought this was communicated with the users. What is the intention of this jira? Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all records are counted without considering nullness of the fields in the records Key: PIG-1014 URL: https://issues.apache.org/jira/browse/PIG-1014 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: Pradeep Kamath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-995) Limit Optimizer throw exception ERROR 2156: Error while fixing projections
[ https://issues.apache.org/jira/browse/PIG-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764119#action_12764119 ] Santhosh Srinivasan commented on PIG-995: - Review comments: The initialization code is fine. However, the try catch block is shared between the rebuildSchemas() and rebuildProjectionMaps() method invocation. This could lead to misleading error message. Specifically, if the rebuildSchemas() throws an exception then the error message will indicate that rebuilding projection maps failed. Limit Optimizer throw exception ERROR 2156: Error while fixing projections Key: PIG-995 URL: https://issues.apache.org/jira/browse/PIG-995 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-995-1.patch, PIG-995-2.patch, PIG-995-3.patch The following script fail: A = load '1.txt' AS (a0, a1, a2); B = order A by a1; C = limit B 10; D = foreach C generate $0; dump D; Error log: Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 2156: Error while fixing projections. Projection map of node to be replaced is null. at org.apache.pig.impl.logicalLayer.ProjectFixerUpper.visit(ProjectFixerUpper.java:138) at org.apache.pig.impl.logicalLayer.LOProject.visit(LOProject.java:408) at org.apache.pig.impl.logicalLayer.LOProject.visit(LOProject.java:58) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.LOForEach.rewire(LOForEach.java:761) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761270#action_12761270 ] Santhosh Srinivasan commented on PIG-984: - bq. But this is in line with what we've done for joins, philosophically, semantically, and syntacticly. Not exactly; with joins we are exposing different kinds of joins. Here we are exposing the underlying aspects of the framework (mapside). If there is a parallel framework that does not do map-reduce then having mapside in the language is philosophically and semantically not correct. PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761028#action_12761028 ] Santhosh Srinivasan commented on PIG-984: - A couple of things: 1. I am concerned about extending the language for supporting features that can be handled internally. The scope of the language has not been defined but the language continues to evolve. 2. I agree with Thejas' comment about allowing expressions that do not alter the property. Pig will not be able to check that but it is no different from being able to check if the data is sorted or not. PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-984) PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
[ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761073#action_12761073 ] Santhosh Srinivasan commented on PIG-984: - bq. This is something that can be inferred looking at the schema and distribution key. I understand wanting a manual handle to turn on the behavior while developing, but the production version of this can be done automatically ( if distributed by and sorted on a subset of group keys, apply map-side group rule in the optimizer). +1 Thats what I meant when I said bq. 1. I am concerned about extending the language for supporting features that can be handled internally. The scope of the language has not been defined but the language continues to evolve. PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data Key: PIG-984 URL: https://issues.apache.org/jira/browse/PIG-984 Project: Pig Issue Type: New Feature Reporter: Richard Ding The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers. However, in the cases where the input data has the following properties 1. The records with the same key are grouped together (such as the data is sorted by the keys). 2. The records with the same key are in the same mapper input. the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads. Alan proposed adding a hint to the group by clause like this one: {code} A = load 'input' using SomeLoader(...); B = group A by $0 using mapside; C = foreach B generate ... {code} The proposed addition of using mapside to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys. It is expected that SomeLoader will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2). It will be the responsibility of the user (or the loader) to guarantee these properties (1) (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-955) Skewed join generates incorrect results
[ https://issues.apache.org/jira/browse/PIG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754349#action_12754349 ] Santhosh Srinivasan commented on PIG-955: - Hi Ying, How are Fragment Replicate Join and Skewed Join related as you mention in your bug description? Also, skewed join has been part of trunk for more than a month now. Your bug description states that Pig needs skewed join. Thanks, Santhosh Skewed join generates incorrect results - Key: PIG-955 URL: https://issues.apache.org/jira/browse/PIG-955 Project: Pig Issue Type: Improvement Reporter: Ying He Attachments: PIG-955.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-922) Logical optimizer: push up project
[ https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747560#action_12747560 ] Santhosh Srinivasan commented on PIG-922: - For relational operators that require multiple inputs, the list will correspond to each of its inputs. If you notice getRequiredFields, the list is populated on a per input basis. In the case of getRequiredInputs, I see that the use of the list is not consistent.for LOJoin, LOUnion, LOCogroup and LOCross. Logical optimizer: push up project -- Key: PIG-922 URL: https://issues.apache.org/jira/browse/PIG-922 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-922-p1_0.patch, PIG-922-p1_1.patch, PIG-922-p1_2.patch This is a continuation work of [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add another rule to the logical optimizer: Push up project, ie, prune columns as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-561) Need to generate empty tuples and bags as a part of Pig Syntax
[ https://issues.apache.org/jira/browse/PIG-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan resolved PIG-561. - Resolution: Duplicate Duplicate of PIG-773 Need to generate empty tuples and bags as a part of Pig Syntax -- Key: PIG-561 URL: https://issues.apache.org/jira/browse/PIG-561 Project: Pig Issue Type: New Feature Affects Versions: 0.2.0 Reporter: Viraj Bhat There is a need to sometimes generate empty tuples and bags as a part of the Pig syntax rather than using UDF's {code} a = load 'mydata.txt' using PigStorage(); b =foreach a generate ( ) as emptytuple; c = foreach a generate { } as emptybag; dump c; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-912) Rename/Add 'string' as a type in place of chararray - and deprecate (and probably eventually remove) the use of 'chararray'
[ https://issues.apache.org/jira/browse/PIG-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740357#action_12740357 ] Santhosh Srinivasan commented on PIG-912: - +1 Rename/Add 'string' as a type in place of chararray - and deprecate (and probably eventually remove) the use of 'chararray' --- Key: PIG-912 URL: https://issues.apache.org/jira/browse/PIG-912 Project: Pig Issue Type: Bug Reporter: Mridul Muralidharan The type 'chararray' in pig does not refer to an array of characters (char []) but rather to java.lang.String This is inconsistent and confusing naming; and additionally, will be a interoperability issue with other systems which support schema's (zebra among others). It would be good to have a consistent naming across projects, while also having appropriate names for the various types. Since use of 'chararray' is already widely deployed, it would be good to : a) Add a type 'string' (or equivalent) which is an alias for 'chararray'. Additionally, it is possible to envision these too (if deemed necessary - not a main requiremnt) : b) Modify documentation and example scripts to use this new type. c) Emit warnings about chararray being deprecated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements
[ https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739147#action_12739147 ] Santhosh Srinivasan commented on PIG-908: - +1 This approach has been discussed but not documented. Need a way to correlate MR jobs with Pig statements --- Key: PIG-908 URL: https://issues.apache.org/jira/browse/PIG-908 Project: Pig Issue Type: Wish Reporter: Dmitriy V. Ryaboy Complex Pig Scripts often generate many Map-Reduce jobs, especially with the recent introduction of multi-store capabilities. For example, the first script in the Pig tutorial produces 5 MR jobs. There is currently very little support for debugging resulting jobs; if one of the MR jobs fails, it is hard to figure out which part of the script it was responsible for. Explain plans help, but even with the explain plan, a fair amount of effort (and sometimes, experimentation) is required to correlate the failing MR job with the corresponding PigLatin statements. This ticket is created to discuss approaches to alleviating this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: In Progress (was: Patch Available) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: Patch Available (was: In Progress) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: Optimizer_Phase5.patch, OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed
[jira] Updated: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-880: Status: Open (was: Patch Available) Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, PIG-880.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-880 started by Santhosh Srinivasan. Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, PIG-880.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-880: Status: Patch Available (was: In Progress) Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, PIG-880.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-898) TextDataParser does not handle delimiters from one complex type in another
[ https://issues.apache.org/jira/browse/PIG-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737319#action_12737319 ] Santhosh Srinivasan commented on PIG-898: - In addition, empty bags, tuples and constants and nulls are not handled. TextDataParser does not handle delimiters from one complex type in another -- Key: PIG-898 URL: https://issues.apache.org/jira/browse/PIG-898 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.4.0 Reporter: Santhosh Srinivasan Priority: Minor Fix For: 0.4.0 Currently, TextDataParser does not handle delimiters of one complex type in another. An example of such a case is key1(#value1} will not be parsed correctly. The production for strings matches any sequence of character that do not contain any delimiters for the complex types. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-880: Attachment: (was: PIG-880.patch) Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, PIG-880_1.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-880: Status: Patch Available (was: In Progress) Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, PIG-880_1.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func
[ https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736990#action_12736990 ] Santhosh Srinivasan commented on PIG-889: - PigHadoopLogger implements the PigLogger interface. As part of the implementation it uses the Hadoop reporter for aggregating the warning messages. Pig can not access reporter of PigHadoopLog in Load Func Key: PIG-889 URL: https://issues.apache.org/jira/browse/PIG-889 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_889_Patch.txt I'd like to increment Counter in my own LoadFunc, but it will throw NullPointerException. It seems that the reporter is not initialized. I looked into this problem and find that it need to call PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-897) Pig should support counters
Pig should support counters --- Key: PIG-897 URL: https://issues.apache.org/jira/browse/PIG-897 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.4.0 Reporter: Santhosh Srinivasan Fix For: 0.4.0 Pig should support the use of counters. The use of the counters can possibly be via the script or via Java APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan reassigned PIG-880: --- Assignee: Santhosh Srinivasan Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-880) Order by is borken with complex fields
[ https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-880: Status: Patch Available (was: Open) Order by is borken with complex fields -- Key: PIG-880 URL: https://issues.apache.org/jira/browse/PIG-880 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-880-bytearray-mapvalue-code-without-tests.patch, PIG-880.patch Pig script: a = load 'studentcomplextab10k' as (smap:map[],c2,c3); f = foreach a generate smap#'name, smap#'age', smap#'gpa' ; s = order f by $0; store s into 'sc.out' Stack: Caused by: java.lang.ArrayStoreException at java.lang.System.arraycopy(Native Method) at java.util.Arrays.copyOf(Arrays.java:2763) at java.util.ArrayList.toArray(ArrayList.java:305) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96) ... 5 more at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769) at org.apache.pig.PigServer.execute(PigServer.java:762) at org.apache.pig.PigServer.access$100(PigServer.java:91) at org.apache.pig.PigServer$Graph.execute(PigServer.java:933) at org.apache.pig.PigServer.executeBatch(PigServer.java:245) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:389) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736283#action_12736283 ] Santhosh Srinivasan commented on PIG-660: - The build.xml in the patch(es) have the reference to hadoop20.jar. The missing part is the hadoop20.jar that Pig can use to build its sources. Pig cannot use the hadoop20.jar coming from the Hadoop release. Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-882) log level not propogated to loggers
[ https://issues.apache.org/jira/browse/PIG-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736359#action_12736359 ] Santhosh Srinivasan commented on PIG-882: - Minor comment: Index: src/org/apache/pig/Main.java === Instead of printing the warning message to stdout, it should be printed to stderr. {code} +catch (IOException e) +{ +System.out.println(Warn: Cannot open log4j properties file, use default); +} {code} The rest of the patch looks fine. log level not propogated to loggers Key: PIG-882 URL: https://issues.apache.org/jira/browse/PIG-882 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Attachments: PIG-882-1.patch, PIG-882-2.patch Pig accepts log level as a parameter. But the log level it captures is not set appropriately, so that loggers in different classes log at the specified level. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func
[ https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735129#action_12735129 ] Santhosh Srinivasan commented on PIG-889: - The issue here is the lack of support for counters within Pig. The intention of warn method in the PigLogger interface was to allow sources within Pig and UDFs for warning aggregation. Your use of the reporter within the logger is not supported. An implementation detail prevents the correct use of this interface for load functions. The Hadoop reporter object is provided in the getRecordReader, map and reduce calls. For load functions, Pig provides an interface and for UDFs, an abstract class. As a result, the logger instance cannot be initialized in the loaders till we decide to add a method to support it. Will having the code from PigMapBase.map() in PigInputFormat.java.getRecordReader work for you? {code} PigHadoopLogger pigHadoopLogger = PigHadoopLogger.getInstance(); pigHadoopLogger.setAggregate(aggregateWarning); pigHadoopLogger.setReporter(reporter); PhysicalOperator.setPigLogger(pigHadoopLogger); {code} Note that this is a workaround for your situation. I would highly recommend that you move to the use of counters when they are supported. Pig can not access reporter of PigHadoopLog in Load Func Key: PIG-889 URL: https://issues.apache.org/jira/browse/PIG-889 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_889_Patch.txt I'd like to increment Counter in my own LoadFunc, but it will throw NullPointerException. It seems that the reporter is not initialized. I looked into this problem and find that it need to call PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar
[ https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734806#action_12734806 ] Santhosh Srinivasan commented on PIG-892: - 1. Index: src/org/apache/pig/builtin/FloatAvg.java === The size of 't' is not checked before t.get(0) in the method count {code} +if (t != null t.get(0) != null) +cnt++; +} {code} 2. Index: src/org/apache/pig/builtin/IntAvg.java === Same comment as FloatAvg.java 3. Index: src/org/apache/pig/builtin/DoubleAvg.java === Same comment as FloatAvg.java 4. Index: src/org/apache/pig/builtin/AVG.java === Same comment as FloatAvg.java 5. Index: src/org/apache/pig/builtin/LongAvg.java === Same comment as FloatAvg.java 6. Index: src/org/apache/pig/builtin/COUNT_STAR.java === I am not sure about the naming convention here. None of the built-in functions have a special character in the class name. COUNTSTAR would be better than COUNT_STAR. Make COUNT and AVG deal with nulls accordingly with SQL standar --- Key: PIG-892 URL: https://issues.apache.org/jira/browse/PIG-892 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.4.0 Attachments: PIG-892.patch, PIG-892_v2.patch both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match COUNT(*) in SQL -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
[ https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734810#action_12734810 ] Santhosh Srinivasan commented on PIG-773: - + 1 for the changes. Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.4.0 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, pig-773_v4.patch, pig-773_v5.patch We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
[ https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-773: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch has been committed. Thanks for the fix Ashutosh. Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.4.0 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, pig-773_v4.patch, pig-773_v5.patch We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-892) Make COUNT and AVG deal with nulls accordingly with SQL standar
[ https://issues.apache.org/jira/browse/PIG-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734417#action_12734417 ] Santhosh Srinivasan commented on PIG-892: - I am reviewing the patch. Make COUNT and AVG deal with nulls accordingly with SQL standar --- Key: PIG-892 URL: https://issues.apache.org/jira/browse/PIG-892 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.4.0 Attachments: PIG-892.patch, PIG-892_v2.patch both COUNT and AVG need to ignore nulls. Also add COUNT_STAR to match COUNT(*) in SQL -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created
[ https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-695: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch has been committed. Pig should not fail when error logs cannot be created - Key: PIG-695 URL: https://issues.apache.org/jira/browse/PIG-695 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Attachments: PIG-695.patch Currently, PIG validates the log file location and fails/exits when the log file cannot be created. Instead, it should print a warning and continue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created
[ https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-695: Fix Version/s: 0.4.0 Pig should not fail when error logs cannot be created - Key: PIG-695 URL: https://issues.apache.org/jira/browse/PIG-695 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-695.patch Currently, PIG validates the log file location and fails/exits when the log file cannot be created. Instead, it should print a warning and continue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733773#action_12733773 ] Santhosh Srinivasan commented on PIG-893: - What are the semantics of chararray (string) to numeric types? Pig does not support conversion of any non-bytearray type to bytearray. The proposal in the jira description is minimalistic. Does it match with that of SQL? Without clear articulation about what these things mean, we cannot/should not support chararray to numeric type conversions. PiggyBank already supports UDFs that convert strings to int, double, etc. It's a nice to have, as part of the language but its better positioned as a UDF. If clear semantics are laid out then making it part of the language will be a matter of consensus. support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Reporter: Thejas M Nair Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-889) Pig can not access reporter of PigHadoopLog in Load Func
[ https://issues.apache.org/jira/browse/PIG-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733825#action_12733825 ] Santhosh Srinivasan commented on PIG-889: - Comments: The reporter inside the logger is setup correctly in PigInputFormat for Hadoop. However the usage of the logger to retrieve the reporter and then increment counters is flawed for the following reasons: 1. In the test case, the new loader uses PigHadoopLogger directly. When the loader is used in local mode, the notion of Hadoop disappears and the reference to PigHadoopLogger is not usable (i.e., will result in a NullPointerException). {code} + @Override + public Tuple getNext() throws IOException { + PigHadoopLogger.getInstance().getReporter().incrCounter( + MyCounter.TupleCounter, 1); + return super.getNext(); + } {code} 2. The loggers were meant for warning aggregations. Here, there is a case being made to expand the capabilities to allow user defined counter aggregations. If thats the case, then new methods have to be added to the PigLogger interface. Pig can not access reporter of PigHadoopLog in Load Func Key: PIG-889 URL: https://issues.apache.org/jira/browse/PIG-889 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_889_Patch.txt I'd like to increment Counter in my own LoadFunc, but it will throw NullPointerException. It seems that the reporter is not initialized. I looked into this problem and find that it need to call PigHadoopLogger.getInstance().setReporter(reporter) in PigInputFormat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created
[ https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-695: Attachment: PIG-695.patch Attached patch ensures that Pig does not error out when the error log file is not writable. Pig should not fail when error logs cannot be created - Key: PIG-695 URL: https://issues.apache.org/jira/browse/PIG-695 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Attachments: PIG-695.patch Currently, PIG validates the log file location and fails/exits when the log file cannot be created. Instead, it should print a warning and continue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work stopped: (PIG-695) Pig should not fail when error logs cannot be created
[ https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-695 stopped by Santhosh Srinivasan. Pig should not fail when error logs cannot be created - Key: PIG-695 URL: https://issues.apache.org/jira/browse/PIG-695 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Attachments: PIG-695.patch Currently, PIG validates the log file location and fails/exits when the log file cannot be created. Instead, it should print a warning and continue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-695) Pig should not fail when error logs cannot be created
[ https://issues.apache.org/jira/browse/PIG-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-695: Status: Patch Available (was: Open) Pig should not fail when error logs cannot be created - Key: PIG-695 URL: https://issues.apache.org/jira/browse/PIG-695 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Attachments: PIG-695.patch Currently, PIG validates the log file location and fails/exits when the log file cannot be created. Instead, it should print a warning and continue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages
[ https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-728: Status: In Progress (was: Patch Available) All backend error messages must be logged to preserve the original error messages - Key: PIG-728 URL: https://issues.apache.org/jira/browse/PIG-728 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Priority: Minor Fix For: 0.4.0 The current error handling framework logs backend error messages only when Pig is not able to parse the error message. Instead, Pig should log the backend error message irrespective of Pig's ability to parse backend error messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java is not consistent and should avoid the use of class_name + ( + string_constructor_args + ). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages
[ https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-728: Attachment: PIG-728_1.patch Attaching a new patch that fixes the findbugs issue. All backend error messages must be logged to preserve the original error messages - Key: PIG-728 URL: https://issues.apache.org/jira/browse/PIG-728 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Priority: Minor Fix For: 0.4.0 Attachments: PIG-728_1.patch The current error handling framework logs backend error messages only when Pig is not able to parse the error message. Instead, Pig should log the backend error message irrespective of Pig's ability to parse backend error messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java is not consistent and should avoid the use of class_name + ( + string_constructor_args + ). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages
[ https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-728: Attachment: PIG-728.patch The attached patch logs all backend error messages before Pig tries to parse the messages. In addition, the log format has been cleaned up to be more user friendly. No new test cases have been added. All backend error messages must be logged to preserve the original error messages - Key: PIG-728 URL: https://issues.apache.org/jira/browse/PIG-728 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Priority: Minor Fix For: 0.4.0 Attachments: PIG-728.patch The current error handling framework logs backend error messages only when Pig is not able to parse the error message. Instead, Pig should log the backend error message irrespective of Pig's ability to parse backend error messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java is not consistent and should avoid the use of class_name + ( + string_constructor_args + ). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-728) All backend error messages must be logged to preserve the original error messages
[ https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-728: Fix Version/s: (was: 0.2.1) 0.4.0 Affects Version/s: (was: 0.2.1) 0.3.0 Status: Patch Available (was: Open) All backend error messages must be logged to preserve the original error messages - Key: PIG-728 URL: https://issues.apache.org/jira/browse/PIG-728 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Priority: Minor Fix For: 0.4.0 Attachments: PIG-728.patch The current error handling framework logs backend error messages only when Pig is not able to parse the error message. Instead, Pig should log the backend error message irrespective of Pig's ability to parse backend error messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java is not consistent and should avoid the use of class_name + ( + string_constructor_args + ). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach
[ https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730525#action_12730525 ] Santhosh Srinivasan commented on PIG-877: - Its at Optimization time. Push up filter does not account for added columns in foreach Key: PIG-877 URL: https://issues.apache.org/jira/browse/PIG-877 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.3.1 Attachments: PIG-877.patch If a filter follows a foreach that produces an added column then push up filter fails with a null pointer exception. {code} ... x = foreach w generate $0, COUNT($1); y = filter x by $1 10; {code} In the above example, the column in the filter's expression is an added column. As a result, the optimizer rule is not able to map it back to the input resulting in a null value. The subsequent for loop is failing due to NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-877) Push up filter does not account for added columns in foreach
[ https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12730595#action_12730595 ] Santhosh Srinivasan commented on PIG-877: - Patch has been committed. Push up filter does not account for added columns in foreach Key: PIG-877 URL: https://issues.apache.org/jira/browse/PIG-877 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.3.1 Attachments: PIG-877.patch If a filter follows a foreach that produces an added column then push up filter fails with a null pointer exception. {code} ... x = foreach w generate $0, COUNT($1); y = filter x by $1 10; {code} In the above example, the column in the filter's expression is an added column. As a result, the optimizer rule is not able to map it back to the input resulting in a null value. The subsequent for loop is failing due to NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-877) Push up filter does not account for added columns in foreach
Push up filter does not account for added columns in foreach Key: PIG-877 URL: https://issues.apache.org/jira/browse/PIG-877 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Fix For: 0.3.1 If a filter follows a foreach that produces an added column then push up filter fails with a null pointer exception. {code} ... x = foreach w generate $0, COUNT($1); y = filter x by $1 10; {code} In the above example, the column in the filter's expression is an added column. As a result, the optimizer rule is not able to map it back to the input resulting in a null value. The subsequent for loop is failing due to NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-877) Push up filter does not account for added columns in foreach
[ https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan reassigned PIG-877: --- Assignee: Santhosh Srinivasan Push up filter does not account for added columns in foreach Key: PIG-877 URL: https://issues.apache.org/jira/browse/PIG-877 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.3.1 If a filter follows a foreach that produces an added column then push up filter fails with a null pointer exception. {code} ... x = foreach w generate $0, COUNT($1); y = filter x by $1 10; {code} In the above example, the column in the filter's expression is an added column. As a result, the optimizer rule is not able to map it back to the input resulting in a null value. The subsequent for loop is failing due to NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach
[ https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-877: Status: Patch Available (was: Open) Push up filter does not account for added columns in foreach Key: PIG-877 URL: https://issues.apache.org/jira/browse/PIG-877 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.3.1 Attachments: PIG-877.patch If a filter follows a foreach that produces an added column then push up filter fails with a null pointer exception. {code} ... x = foreach w generate $0, COUNT($1); y = filter x by $1 10; {code} In the above example, the column in the filter's expression is an added column. As a result, the optimizer rule is not able to map it back to the input resulting in a null value. The subsequent for loop is failing due to NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-877) Push up filter does not account for added columns in foreach
[ https://issues.apache.org/jira/browse/PIG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-877: Attachment: PIG-877.patch Attached patch fixes the NPE. Push up filter does not account for added columns in foreach Key: PIG-877 URL: https://issues.apache.org/jira/browse/PIG-877 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.3.1 Attachments: PIG-877.patch If a filter follows a foreach that produces an added column then push up filter fails with a null pointer exception. {code} ... x = foreach w generate $0, COUNT($1); y = filter x by $1 10; {code} In the above example, the column in the filter's expression is an added column. As a result, the optimizer rule is not able to map it back to the input resulting in a null value. The subsequent for loop is failing due to NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-873) Optimizer should allow search for global patterns
Optimizer should allow search for global patterns - Key: PIG-873 URL: https://issues.apache.org/jira/browse/PIG-873 Project: Pig Issue Type: Improvement Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Fix For: 0.4.0 Currently, the optimizer works on the following mechanism: 1. Specify the pattern to be searched 2. For each occurrence of the pattern, check and then apply a transformation With this approach, the search for a pattern is localized. An example will illustrate the problem. If the pattern to be searched for is foreach (with flatten) connected to any operator and if the graph has more than one foreach (with flatten) connected to an operator (cross, join, union, etc), then each instance of foreach connected to the operator is returned as a match. While this is fine for a localized view (per match), at a global view the pattern to be searched for is any number of foreach connected to an operator. The implication of not having a globalized view is more rules. There will be one rule for one foreach connected to an opeator, one rule for two foreachs connected to an operators, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-874) Problems in pushing down foreach with flatten
Problems in pushing down foreach with flatten - Key: PIG-874 URL: https://issues.apache.org/jira/browse/PIG-874 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Santhosh Srinivasan Fix For: 0.4.0 If the graph contains more than one foreach connected to an operator, pushing down foreach with flatten is not possible with the current optimizer pattern matching algorithm and current implementation of rewire. The following mechanism of pushing foreach with flatten does not work. 1. Search for foreach (with flatten) connected to an operator 2. If checks pass then unflatten the flattened column in the foreach 3. Create a new foreach that flattens the mapped column (the original column number could have changed) and insert the new foreach after the old foreach's successor. An example to illustrate the problem: {code} A = load 'myfile' as (name, age, gpa:(letter_grade, point_score)); B = foreach A generate $0, $1, flatten($2); C = load 'anotherfile' as (name, age, preference:(course_name, instructor)); D = foreach C generate $0, $1, flatten($2); E = join B by $0, D by $0 using replicated; F = limit E 10; {code} In the code snipped (see above), the optimizer will find two matches, B-E and D-E. For the first pattern match (B-E), $2 will be unflattened and a new foreach will be introduced after the join. {code} A = load 'myfile' as (name, age, gpa:(letter_grade, point_score)); B = foreach A generate $0, $1, $2; C = load 'anotherfile' as (name, age, preference:(course_name, instructor)); D = foreach C generate $0, $1, flatten($2); E = join B by $0, D by $0 using replicated; E1 = foreach E generate $0, $1, flatten($2), $3, $4, $5, $6; F = limit E1 10; {code} For the second match (D-E), the same transformation is applied. However, this transformation will not work for the following reason. The new foreach is now inserted between the E and E1. When E1 is rewired, rewire is unable to map $6 in E1 as it never exists in E. In order to fix such situations, the pattern matching should return a global match instead of a local match. Reference: PIG-873 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726566#action_12726566 ] Santhosh Srinivasan commented on PIG-697: - 1. Removing added fields from the flattened set. The flattened set is the set of all flattened columns. It can contain mapped and added fields. In order to remove the added fields from this set, the removeAll method is used. 2. Comments on why the rule applies only to Order, Cross and Join Will add these comments. 3. Removing code in LOForEach for flattening a bag with unknown schema The code that I removed was redundant and also had a bug. The check for a field getting mapped was neglected in one case. After I added the check, the code for the if and the else was identical. I removed the redundant code and made it simpler. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726601#action_12726601 ] Santhosh Srinivasan commented on PIG-697: - -1 javac. The applied patch generated 250 javac compiler warnings (more than the trunk's current 248 warnings). The additional 2 compiler warning messages are related to type inference. At this point these messages are harmless. -1 javac. The applied patch generated 250 javac compiler warnings (more than the trunk's current 248 warnings). Dodgy warning: The find bug warnings are harmless, there is an explicit check for null to print null as opposed to the contents of the object. Correctness warning: There are checks in place to ensure that the variable can never be null. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726684#action_12726684 ] Santhosh Srinivasan commented on PIG-697: - Phase 4 part 2 patch has been committed Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch, OptimizerPhase4_part2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second
[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-792: Status: Patch Available (was: Open) PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725261#action_12725261 ] Santhosh Srinivasan commented on PIG-697: - Phase 4 part 1 patch has been committed. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. *
[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
[ https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725279#action_12725279 ] Santhosh Srinivasan commented on PIG-773: - Review comments: 1. In addition to checking the type of the constant, the value should also be checked. The checks on the data type is good. A check on the actual contents of the empty bag, empty tuple and empty map will complete the testing. {code} +LOConst loConst = (LOConst)logOp; +assertTrue(loConst.getType() == DataType.TUPLE); +assertTrue(loConst.getValue() instanceof Tuple); {code} 2. When you have a bag like {(), (1)}, the schema of this bag is returned as a bag that contains a tuple that has no schema. This might be the right approach for now, i.e., if a bag contains a tuple with no schema then the schema of the bag will contain a tuple with no schema irrespective of the contents of the remaining tuple. This approach/idea falls into the bigger question of how to handle unknown schemas in Pig. Since Alan is looking at this question for all of Pig, it will be good if he can review this part. Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.4.0 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-866) Pig should support ability to query unique column name when there is no ambiguity
Pig should support ability to query unique column name when there is no ambiguity - Key: PIG-866 URL: https://issues.apache.org/jira/browse/PIG-866 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Santhosh Srinivasan Fix For: 0.4.0 Currently, the default alias of a column following a flatten contains the disambiguator ::. For columns that have a unique name, the :: disambiguator is not required. Although, Pig supports column access via the unique name and the disambiguated name, there is no support to retrieve the unique column name. This is a nice to have enhancement. An example below will illustrate the issue: {code} grunt a = load 'input' as (name, age, gpa); grunt b = group a ALL; grunt c = foreach b generate flatten(a); grunt describe c; c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} grunt d = foreach c generate name; grunt describe d; d: {a::name: bytearray} {code} In the example shown above, although the column name is allowed in the relation 'd', the name of the column appears as 'a::name' in the schema. The workaround for this issue is to use the AS clause in the foreach. However, this is cumbersome for users and its something that can be fixed within Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-866) Pig should support ability to query unique column name when there is no ambiguity
[ https://issues.apache.org/jira/browse/PIG-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725303#action_12725303 ] Santhosh Srinivasan commented on PIG-866: - This support has to be extended to the FieldSchema class when Java APIs are used to query the aliases. Pig should support ability to query unique column name when there is no ambiguity - Key: PIG-866 URL: https://issues.apache.org/jira/browse/PIG-866 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Santhosh Srinivasan Fix For: 0.4.0 Currently, the default alias of a column following a flatten contains the disambiguator ::. For columns that have a unique name, the :: disambiguator is not required. Although, Pig supports column access via the unique name and the disambiguated name, there is no support to retrieve the unique column name. This is a nice to have enhancement. An example below will illustrate the issue: {code} grunt a = load 'input' as (name, age, gpa); grunt b = group a ALL; grunt c = foreach b generate flatten(a); grunt describe c; c: {a::name: bytearray,a::age: bytearray,a::gpa: bytearray} grunt d = foreach c generate name; grunt describe d; d: {a::name: bytearray} {code} In the example shown above, although the column name is allowed in the relation 'd', the name of the column appears as 'a::name' in the schema. The workaround for this issue is to use the AS clause in the foreach. However, this is cumbersome for users and its something that can be fixed within Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: Patch Available (was: In Progress) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724704#action_12724704 ] Santhosh Srinivasan commented on PIG-697: - The find bug warnings are harmless, there are explicit checks for null to print null as opposed to the contents of the object. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1-1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: In Progress (was: Patch Available) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, OptimizerPhase4_part1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple
[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
[ https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722701#action_12722701 ] Santhosh Srinivasan commented on PIG-773: - Comments: 1. Minor comment - the comment on the empty productions all have the same text For tuple and bag, it should be changed to tuple and bag respectively {code} + |{ } // Match the empty content in map. {code} 2. I am not sure about the test case testEmptyBagConstRecursive. Here the bag contains an empty tuple. As a result, the field schema for the bag should contain the schema of the empty tuple. The test case will probably fail. {code} +@Test +public void testEmptyBagConstRecursive() throws FrontendException{ + +LogicalPlan lp = buildPlan(a = foreach (load 'b') generate {()};); +LOForEach foreach = (LOForEach) lp.getLeaves().get(0); + +Schema.FieldSchema bagFs = new Schema.FieldSchema(null,null,DataType.BAG); +Schema expectedSchema = new Schema(bagFs); + +assertTrue(Schema.equals(foreach.getSchema(), expectedSchema, false, true)); +} {code} 3. There are no tests that check if the empty constants are actually created, i.e., there are no checks for expected empty constants. The test below checks if the parser can parse the new syntax for empty constants. In addition, the values generated by the parser have to checked against expected values for these constants. {code} +@Test +public void testRandomEmptyConst(){ +// Various random scripts to test recursive nature of parser with empty constants. + +buildPlan(a = foreach (load 'b') generate {({})};); +buildPlan(a = foreach (load 'b') generate ({()});); +buildPlan(a = foreach (load 'b') generate {(),()};); +buildPlan(a = foreach (load 'b') generate ({},{});); +buildPlan(a = foreach (load 'b') generate ((),());); +buildPlan(a = foreach (load 'b') generate ([],[]);); +buildPlan(a = foreach (load 'b') generate {({},{})};); +buildPlan(a = foreach (load 'b') generate {([],[])};); +buildPlan(a = foreach (load 'b') generate (({},{}));); +buildPlan(a = foreach (load 'b') generate (([],[]));); +} {code} Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Pradeep Kamath Priority: Minor Attachments: pig-773.patch, pig-773_v2.patch We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-851: Status: In Progress (was: Patch Available) Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Attachments: patch_815.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-851: Patch Info: [Patch Available] Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Attachments: patch_815.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721975#action_12721975 ] Santhosh Srinivasan commented on PIG-697: - 1. Some operators do not have any internal state that requires rewiring. Examples of such operators include LOStream, LOCross, etc. 2. I think that the additional walking should be removed. I added a TODO as I was not sure why it was added in the first place. 3. Yes, it will be added as part of the next patch. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException {
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722075#action_12722075 ] Santhosh Srinivasan commented on PIG-697: - OptimizerPhase3_part2_3.patch has been committed. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator,
[jira] Commented: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721564#action_12721564 ] Santhosh Srinivasan commented on PIG-753: - +1 for the code changes. The license header and the unit tests that failed have to be checked. Provide support for UDFs without parameters --- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Jeff Zhang Attachments: Pig_753_Patch.txt Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721592#action_12721592 ] Santhosh Srinivasan commented on PIG-856: - Would that be through a configuration parameter? What would be the default 1 or 2 ? PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721596#action_12721596 ] Santhosh Srinivasan commented on PIG-856: - Essentially, are we adding more knobs to tune Pig? We should document these knobs and explain how they interact with each other. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Attachment: (was: OptimizerPhase3_part2_2.patch) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second operator, will be pushed in
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: Patch Available (was: In Progress) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: In Progress (was: Patch Available) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second operator, will be pushed in front of
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: Patch Available (was: In Progress) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second
[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720292#action_12720292 ] Santhosh Srinivasan commented on PIG-851: - Review comments: 1. The new sources test/org/apache/pig/test/utils/MyUDFReturnMap.java and test/org/apache/pig/test/TestUDFReturnMap.java need to include the Apache license headers 2. The use of package sun.reflect.generics.reflectiveObjects.ParameterizedTypeImpl is resulting in 3 compiler warnings and 1 javadoc warning. Can we use a different package? 3. The test case in TestUDFReturnMap runs the test in local mode (i.e., ExecType.LOCAL). Another test for map reduce mode, ExecType.MAPREDUCE, should be added. Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Fix For: 0.3.0 Attachments: Pig_815_patch.txt When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: In Progress (was: Patch Available) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second operator, will be pushed in front of
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Status: Patch Available (was: In Progress) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second
[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-697: Attachment: OptimizerPhase3_part2_2.patch Attached patch fixes the findbug warning, and cleans up the sources by removing commented out code. The additional 35 compiler warning messages are related to type inference. At this point these messages are harmless. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first
[jira] Commented: (PIG-728) All backend error messages must be logged to preserve the original error messages
[ https://issues.apache.org/jira/browse/PIG-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719617#action_12719617 ] Santhosh Srinivasan commented on PIG-728: - In addition, when the framework is not able to parse the error message, the message should be annotated as such. Extraneous details like Unable to recreate exception, Cannot create exception from empty string, etc should not be communicated to the user. These messages reflect internal workings of the error handling framework and do not add value to the user. All backend error messages must be logged to preserve the original error messages - Key: PIG-728 URL: https://issues.apache.org/jira/browse/PIG-728 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Priority: Minor Fix For: 0.2.1 The current error handling framework logs backend error messages only when Pig is not able to parse the error message. Instead, Pig should log the backend error message irrespective of Pig's ability to parse backend error messages. On a side note, the use of instantiateFuncFromSpec in Launcher.java is not consistent and should avoid the use of class_name + ( + string_constructor_args + ). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-605) Better explain and console output
[ https://issues.apache.org/jira/browse/PIG-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719731#action_12719731 ] Santhosh Srinivasan commented on PIG-605: - In addition, it will be very useful for users if the plans have the line numbers of the pig script that resulted in the final plan. For example, the plan should state Line number 10, 12, 14 to help users work backwards from the plan to the original script. Better explain and console output - Key: PIG-605 URL: https://issues.apache.org/jira/browse/PIG-605 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han It would be nice if when we explain the script, the corresponding mapred jobs can be explicitly mark out in a neat way. While we execute the script, the console output could print the name and url of the corresponding hadoop jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-851) Map type used as return type in UDFs not recognized at all times
Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Fix For: 0.3.0 When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-851) Map type used as return type in UDFs not recognized at all times
[ https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719886#action_12719886 ] Santhosh Srinivasan commented on PIG-851: - A workaround for this issue is to override the outputSchema method and return the appropriate schema. Map type used as return type in UDFs not recognized at all times Key: PIG-851 URL: https://issues.apache.org/jira/browse/PIG-851 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Fix For: 0.3.0 When an UDF returns a map and the outputSchema method is not overridden, Pig does not figure out the data type. As a result, the type is set to unknown resulting in run time failure. An example script and UDF follow {code} public class mapUDF extends EvalFuncMapObject, Object { @Override public MapObject, Object exec(Tuple input) throws IOException { return new HashMapObject, Object(); } //Note that the outputSchema method is commented out /* @Override public Schema outputSchema(Schema input) { try { return new Schema(new Schema.FieldSchema(null, null, DataType.MAP)); } catch (FrontendException e) { return null; } } */ {code} {code} grunt a = load 'student_tab.data'; grunt b = foreach a generate EXPLODE(1); grunt describe b; b: {Unknown} grunt dump b; 2009-06-15 17:59:01,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2080: Foreach currently does not handle type Unknown {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-842) PigStorage should support multi-byte delimiters
PigStorage should support multi-byte delimiters --- Key: PIG-842 URL: https://issues.apache.org/jira/browse/PIG-842 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Santhosh Srinivasan Fix For: 0.3.0 Currently, PigStorage supports single byte delimiters. Users have requested mult-byte delimiters. There are performance implications with multi-byte delimiters. i.e., instead of looking for a single byte, PigStorage should look for a pattern ala BinStorage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-839) incorrect return codes on failure when using -f or -e flags
[ https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717451#action_12717451 ] Santhosh Srinivasan commented on PIG-839: - There are no unit test cases as the testing was performed manually. Pasting a test run below. {code} $ cat /errcode.pig a = load '/user/sms/data/student_tab.data' ; b = stream a through `false` ; store b into '/user/sms/data/errcode.out'; #Before fix $ java -cp pig.jar:/home/y/conf/pig/piglet/released org.apache.pig.Main -f errcode.pig 2009-06-08 14:40:51,917 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-08 14:40:51,926 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2055: Received Error while processing the map plan: 'false ' failed with exit status: 1 Details at logfile: pig_1244497222536.log afterside 14:40:53 ~/src_pig/pig/trunk_optimizer_phase3_part2 $ echo $? 0 #After fix $ java -cp pig.jar:/home/y/conf/pig/piglet/released org.apache.pig.Main -f /homes/sms/src_pig/pig/trunk_optimizer_phase3_part2/errcode.pig 2009-06-08 14:42:20,422 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-06-08 14:42:20,434 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2055: Received Error while processing the map plan: 'false ' failed with exit status: 1 Details at logfile: /homes/sms/src_commit/pig/trunk/pig_1244497306578.log afterside 14:42:21 ~/src_commit/pig/trunk $ echo $? 2 {code} incorrect return codes on failure when using -f or -e flags --- Key: PIG-839 URL: https://issues.apache.org/jira/browse/PIG-839 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner Assignee: Gunther Hagleitner Attachments: fix_return_code.patch To repro: pig -e a = load 'some file' ; b = stream a through \`false\` ; store b into 'some file'; Both the -e and -f flags do not return the right code upon exit. Running the script w/o using -f works fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
[ https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717508#action_12717508 ] Santhosh Srinivasan commented on PIG-773: - You can ignore that. Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Priority: Minor Attachments: pig-773.patch We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715143#action_12715143 ] Santhosh Srinivasan commented on PIG-697: - The graph operation pushAfter was added as a complementary operation to pushBefore. Currently, on the logical side, there are no concrete use cases for pushAfter. The only operator that truly supports multiple outputs is split. Our current model for split is to have an no-op split operator that has multiple successors, split outputs, each of which is the equivalent of a filter. The split output has inner plans which could have projection operators that hold references to the split's predecessor. When an operator is pushed after split, the operator will be placed between the split and split output. As a result, when rewire on split is called, the call is dispatched to the split output. The references in the split output after the rewire will now point to split's predecessor instead of pointing to the operator that was pushed after. The intention of the pushAfter in the case of a split is to push it after the split output. However, the generic pushAfter operation does not distinguish between split and split output. A possible way out is to override this method in the logical plan and duplicate most of the code in the OperatorPlan and add new code to handle split. As of now, the pushAfter will not be used in the logical layer. Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, OptimizerPhase3_parrt1.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules