[jira] Updated: (PIG-537) Failure in Hadoop map collect stage due to type mismatch in the keys used in cogroup
[ https://issues.apache.org/jira/browse/PIG-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-537: --- Status: Patch Available (was: Open) The issue was in Implicit Split inserter. In this query, the same load provides input to two cogroups. Hence an implicit split needs to be introduced. However the ImplicitSplitInserter was changing the order of the inputs to the first cogroup as it was rewiring the plan with the new Split and SplitOutput operators. The patch changes the algorithm for introducing these new operators so that the order of the inputs for the successors of the load is maintained. Failure in Hadoop map collect stage due to type mismatch in the keys used in cogroup Key: PIG-537 URL: https://issues.apache.org/jira/browse/PIG-537 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Viraj Bhat Assignee: Pradeep Kamath Priority: Critical Fix For: types_branch Attachments: explain_aliasC.log, mygrades.txt, mymarks.txt Consider the following pig query, which demonstrates various problems during the Logical Plan creation and the subsequent execution of the M/R job. In this query we do two cogroups, one between A and B to generate an alias ABtemptable. Then we again cogroup A with ABtemptable based on marks which was read in as an int. == {code} A = load 'mymarks.txt' as (marks:int, username:chararray); B = load 'mygrades.txt' as (username:chararray,grade:chararray); ABtemp = cogroup A by username, B by username; ABtemptable = foreach ABtemp generate group as username, flatten(A.marks) as newmarks; --describe ABtemptable; C = cogroup A by marks, ABtemptable by newmarks; --describe C; explain C; dump C; {code} == The schema for C and ABtemptable which pig reports: == {code}describe ABtemptable;{code} ABtemptable: {username: chararray,newmarks: int} {code}describe C;{code} C: {group: int,A: {username: chararray,marks: int},ABtemptable: {username: chararray,newmarks: int}} == If you run the above query you get the following error: == 2008-11-18 03:57:14,372 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (map) task_200810152105_0156_m_00java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableIntWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:97) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:172) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:82) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) == Looking at the {code}explain C;{code} output, you see that newmarks has become a chararray (surprising!!) == ---CoGroup viraj-Tue Nov 18 03:49:42 UTC 2008-25 Schema: {group: Unknown,{username: bytearray,marks: int},ABtemptable: {username: chararray,newmarks: chararray}} Type: bag Project viraj-Tue Nov 18 03:49:42 UTC 2008-23 Projections: [1] Overloaded: false FieldSchema: marks: int Type: int Input: SplitOutput[null] viraj-Tue Nov 18 03:49:42 UTC 2008-29 Project viraj-Tue Nov 18 03:49:42 UTC 2008-24 Projections: [1] Overloaded: false FieldSchema: newmarks: chararray Type: chararray Input: ForEach viraj-Tue Nov 18 03:49:42 UTC 2008-22 ---ForEach viraj-Tue Nov 18 03:49:42 UTC 2008-22 Schema: {username: chararray,newmarks: chararray} Type: bag == In Summary this script demonstrates the following problems: 1) Logical Plan
[jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query
[ https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-563: --- Attachment: PIG-563.patch PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query -- Key: PIG-563 URL: https://issues.apache.org/jira/browse/PIG-563 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-563.patch Currently Pig's use of the combiner assumes the combiner is called exactly once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. This issue is to track changes needed in the CombinerOptimizer visitor and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this new model. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-573) Changes to make Pig run with Hadoop 19
[ https://issues.apache.org/jira/browse/PIG-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-573: --- Attachment: hadoop19.jar PIG-573-combinerflag.patch PIG-573.patch PIG-573.patch contains the main changes for hadoop19. Essentially the changes were to reflect the renaming of org.apache.hadoop.dfs as org.apache.hadoop.hdfs, use of correct api in case of deprecated ones and change in build.xml to use hadoop19.jar. Caveats: 1) http://issues.apache.org/jira/browse/PIG-563 has a change in JobControlCompiler to remove the use of jobConf.setCombineOnceOnly(true). So if PIG-573.patch needs to be used BEFORE PIG-563 is committed, PIG-573-combinerflag.patch should also be used to get this change. 2) PIG-573 depends on a hadoop19.jar being present in lib dir. I have attached a hadoop19.jar for use. 3) Changes are needed in the combiner to work with this patch - those changes are in http://issues.apache.org/jira/browse/PIG-563 Changes to make Pig run with Hadoop 19 -- Key: PIG-573 URL: https://issues.apache.org/jira/browse/PIG-573 Project: Pig Issue Type: Task Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: hadoop19.jar, PIG-573-combinerflag.patch, PIG-573.patch This issue tracks changes to Pig code to make it work with Hadoop-0.19.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-573) Changes to make Pig run with Hadoop 19
[ https://issues.apache.org/jira/browse/PIG-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658662#action_12658662 ] Pradeep Kamath commented on PIG-573: Hbase code in pig doesn't work with hadoop 19 and the files submitted thus far do not address this. A hbase-0.19.0.jar and hbase-0.19.0-test.jar are required and possibly other changes may be required. Changes to make Pig run with Hadoop 19 -- Key: PIG-573 URL: https://issues.apache.org/jira/browse/PIG-573 Project: Pig Issue Type: Task Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: hadoop19.jar, PIG-573-combinerflag.patch, PIG-573.patch This issue tracks changes to Pig code to make it work with Hadoop-0.19.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query
[ https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-563: --- Attachment: PIG-563-v3.patch COUNT.Initial was implemented that way so that in case it is called in the non-combiner case in the reduce, it would produce the right result. However since currently we plan to call COUNT.initial only when the combine plan is also present, we can be guaranteed it is called only in the Map - So I have changed it to emit 1 as suggested in the review comment in the new version in the attachment. PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query -- Key: PIG-563 URL: https://issues.apache.org/jira/browse/PIG-563 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-563-v2.patch, PIG-563-v3.patch, PIG-563.patch Currently Pig's use of the combiner assumes the combiner is called exactly once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. This issue is to track changes needed in the CombinerOptimizer visitor and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this new model. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach
[ https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-580: --- Attachment: PIG-580-v2.patch Attaching a new version (PIG-580-v2.patch) - the only difference from the earlier one is that I have removed a debug statement out of test/org/apache/pig/test/Util.java. PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach Key: PIG-580 URL: https://issues.apache.org/jira/browse/PIG-580 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-580-v2.patch, PIG-580.patch Currently Pig uses the combiner only when there is foreach following a group when the elements in the foreach generate have the following characteristics: 1) simple project of the group column 2) Algebraic UDF The above conditions exclude use of the combiner for distinct aggregates - the distinct operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic udf). So if the following foreach should also be combinable: {code} .. b = group a by $0; c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) } {code} The combiner optimizer should cause the distinct to be combined and the final combine output should feed the COUNT() and SUM() in the reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-581) Pig should enable an option to disable the use of combiner optimizer
[ https://issues.apache.org/jira/browse/PIG-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-581: --- Summary: Pig should enable an option to disable the use of combiner optimizer (was: Pig should enable an option to disable the use of optimizer) Pig should enable an option to disable the use of combiner optimizer Key: PIG-581 URL: https://issues.apache.org/jira/browse/PIG-581 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Fix For: types_branch There are some cases where a combiner optimization chosen by Pig may actually be slower than the non optimized version. For example, the use of combiner to address the issue reported in https://issues.apache.org/jira/browse/PIG-580 could result in slower execution IF the distinct on groups of values does not actually shrink those groups. This is however very data dependent and the user may know before hand that this might be the case and may wish to disable the use of the optimizer. Pig should enable an option to do so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-554: --- Attachment: PIG-554-v4.patch Changes in new patch (attached): 1) The HashMap now has (tuple, ListTuple) to address the concern that Bag would be worse spacewise than a ListTuple. BagFactory now has a method newDefaultBag(ListTuple) which will create a DefaultDataBag out of the ListTuple by taking ownership of the list and without copying the elements. This way in POFRJoin.getNext() we can create a bag out of the ListTuple without much overhead. 2) Added back the unit test - TestFRJoin - the change in this file is to use Util.createInputFile() to create input file for the tests on the minicluster DFS rather than local file system. It used Util.deleteFile() to delete the file before after each test run. Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch, frjofflat1.patch, PIG-554-v3.patch, PIG-554-v4.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-628) PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, set up of PigMapReduce PigCombiner, accessing index in POLocalRearrange
PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, set up of PigMapReduce PigCombiner, accessing index in POLocalRearrange - Key: PIG-628 URL: https://issues.apache.org/jira/browse/PIG-628 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch - Currently DefaultTuple.write() needlessly writes a marker for null/not null. This is already handled by PigNullableWritable for keys and NullableTuple for values. Nested null tuples inside a tuple are written out as nulls in DataReaderWriter.writeDatum. So the null/not null marker in DefaultTuple can be avoided. - In PigMapReduce and PigCombiner the roots and leaves of the plans are calculated in each reduce() call. Instead these can be computed in configure() one time. - In each call of POLocalRearrange.getNext(), a new lroutput tuple is created whose first position is filled with index, second with key and third with value - this can be optimized by having a tuple member in POLocalRearrange which is reused in each getNext() call. Further, the first position of this tuple can be pre-filled with the index in the setIndex() method of POLocalRearrange at script compile time. - In POCombinerPackage, the metadata data structures to figure out which parts of the value are present in the key can be set up in the setKeyInfo() method at compile time. This is because we currently use POCombinerPackage only with a group by. Hence we don't need to look up the metadata at run time based on input index since there will be only one input (index = 0) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-628) PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, set up of PigMapReduce PigCombiner, accessing index in POLocalRearrange
[ https://issues.apache.org/jira/browse/PIG-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-628: --- Attachment: PIG-628.patch Attached patch which implements the changes described in the issue description. PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, set up of PigMapReduce PigCombiner, accessing index in POLocalRearrange - Key: PIG-628 URL: https://issues.apache.org/jira/browse/PIG-628 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-628.patch - Currently DefaultTuple.write() needlessly writes a marker for null/not null. This is already handled by PigNullableWritable for keys and NullableTuple for values. Nested null tuples inside a tuple are written out as nulls in DataReaderWriter.writeDatum. So the null/not null marker in DefaultTuple can be avoided. - In PigMapReduce and PigCombiner the roots and leaves of the plans are calculated in each reduce() call. Instead these can be computed in configure() one time. - In each call of POLocalRearrange.getNext(), a new lroutput tuple is created whose first position is filled with index, second with key and third with value - this can be optimized by having a tuple member in POLocalRearrange which is reused in each getNext() call. Further, the first position of this tuple can be pre-filled with the index in the setIndex() method of POLocalRearrange at script compile time. - In POCombinerPackage, the metadata data structures to figure out which parts of the value are present in the key can be set up in the setKeyInfo() method at compile time. This is because we currently use POCombinerPackage only with a group by. Hence we don't need to look up the metadata at run time based on input index since there will be only one input (index = 0) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-634) When POUnion is one of the roots of a map plan, POUnion.getNext() gives a null pointer exception
[ https://issues.apache.org/jira/browse/PIG-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-634: --- Resolution: Fixed Status: Resolved (was: Patch Available) When POUnion is one of the roots of a map plan, POUnion.getNext() gives a null pointer exception Key: PIG-634 URL: https://issues.apache.org/jira/browse/PIG-634 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-634.patch POUnion.getnext() gives a null pointer exception in the following scenario (pasted from a code comment explaining the fix for this issue). If a script results in a plan like the one below, currently POUnion.getNext() gives a null pointer exception {noformat} // POUnion // | // |--POLocalRearrange // || // ||-POUnion (root 2)-- This union's getNext() can lead the code here // | // |--POLocalRearrange (root 1) // The inner POUnion above is a root in the plan which has 2 roots. // So these 2 roots would have input coming from different input // sources (dfs files). So certain maps would be working on input only // meant for root 1 above and some maps would work on input // meant only for root 2. In the former case, root 2 would // neither get input attached to it nor does it have predecessors {noformat} A script which can cause a plan like above is: {code} a = load 'xyz'; b = load 'abc'; c = union a,b; d = load 'def'; e = cogroup c by $0 inner , d by $0 inner; dump e; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-634) When POUnion is one of the roots of a map plan, POUnion.getNext() gives a null pointer exception
[ https://issues.apache.org/jira/browse/PIG-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667439#action_12667439 ] Pradeep Kamath commented on PIG-634: Patch committed When POUnion is one of the roots of a map plan, POUnion.getNext() gives a null pointer exception Key: PIG-634 URL: https://issues.apache.org/jira/browse/PIG-634 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-634.patch POUnion.getnext() gives a null pointer exception in the following scenario (pasted from a code comment explaining the fix for this issue). If a script results in a plan like the one below, currently POUnion.getNext() gives a null pointer exception {noformat} // POUnion // | // |--POLocalRearrange // || // ||-POUnion (root 2)-- This union's getNext() can lead the code here // | // |--POLocalRearrange (root 1) // The inner POUnion above is a root in the plan which has 2 roots. // So these 2 roots would have input coming from different input // sources (dfs files). So certain maps would be working on input only // meant for root 1 above and some maps would work on input // meant only for root 2. In the former case, root 2 would // neither get input attached to it nor does it have predecessors {noformat} A script which can cause a plan like above is: {code} a = load 'xyz'; b = load 'abc'; c = union a,b; d = load 'def'; e = cogroup c by $0 inner , d by $0 inner; dump e; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner
PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner Key: PIG-636 URL: https://issues.apache.org/jira/browse/PIG-636 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Currently whenever Combiner is used in pig, in the map, the POPrecombinerLocalRearrange operator puts the single value tuple corresponding to a key into a DataBag and passes this to the foreach which is being combined. This will generate as many bags as there are input records. These bags all will have a single tuple and hence are small and should not need to be spilt to disk. However since the bags are created through the BagFactory mechanism, each bag creation is registered with the SpillableMemoryManager and a weak reference to the bag is stored in a linked list. This linked list grows really big over time causing unnecessary Garbage collection runs. This can be avoided by having a simple lightweight implementation of the DataBag interface to store the single tuple in a bag. Also these SingleTupleBags should be created without registering with the spillableMemoryManager. Likewise the bags created in POCombinePackage are supposed to fit in Memory and not spill. Again a NonSpillableDataBag implementation of DataBag interface which does not register with the SpillableMemoryManager would help. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-645) Streaming is broken with the latest trunk
[ https://issues.apache.org/jira/browse/PIG-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-645: --- Attachment: PIG-645.patch Streaming is broken with the latest trunk - Key: PIG-645 URL: https://issues.apache.org/jira/browse/PIG-645 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-645.patch Several tests we run are failing now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-645) Streaming is broken with the latest trunk
[ https://issues.apache.org/jira/browse/PIG-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-645: --- Fix Version/s: types_branch Affects Version/s: types_branch Status: Patch Available (was: Open) Attached patch to fix the issues with streaming. The root cause of the issue was the changes introduced by PIG-629 (PERFORMANCE: Eliminate use of TargetedTuple for each input tuple in the map()) caused a race condition on the input tuple in the map() between recordReader.next(Tuple value) and the streaming binary. BEFORE PIG-629, the flow of a tuple from record reader to map was as follows: The recordReader instance gets the *same* TargetedTuple object reference in every next(TargetedTuple value) call (this is because Hadoop reuses the value object for each recordReader.next(value) call). The recordReader.next(value) call inturn calls PigSlice.next(Tuple value) which has the following implementation: {code} public boolean next(Tuple value) throws IOException { Tuple t = loader.getNext(); if (t == null) { return false; } value.reference(t); return true; } {code} Here value.reference(t) calls the TargetedTuple.reference(Tuple) method which simply stores the supplied the tuple in its member Tuple variable t. In PigMapBase.map(), the toTuple() method on the input TargetedTuple is called which returns the above store tuple reference t. This reference is then attached to the roots of the map plan. The point to note is this final tuple reference which is used by the operators in the map plan is the reference to the tuple returned from the loader and not the reference to the TargetedTuple which we get from the recordReader and which is supplied as an argument to the map() call. The loader creates a new tuple reference on each getNext(). This guarantees that the operators in the map plan always work with a differnt tuple reference on each map() call though the TargetedTuple reference supplied in the map() is the same and reused by Hadoop. AFTER PIG-629, the flow changed as follows: TargetedTuple was removed and Tuple was used instead. The PigSlice.next(Tuple value) code remained intact. However DefaultTuple.reference(Tuple) call in it assigns the internal mFields arraylist to the arraylist of the supplied tuple. Note that here the internal member arraylist of the DefaultTuple is changed to refer to the internal arraylist of the Tuple the loader gives. In map(), the tuple which is supplied as input argument to the map() call is attached directly to the roots. So in the case of streaming, this tuple is finally supplied to the binary by using a storage function (PigStorage by default). However this tuple refernce is the same as the one which gets reused by hadoop in the next recordReader.next(value) call. So while the storage function is in the process of writing the current Tuple's contents (the mFields arraylist), it can get changed underneath due to recordReader.next(value) call. So unless the storage functions writes to the binary's stdin BEFORE the next recordReader.next(value) call, the input sent to the Binary will be garbled. The fix is the following one line change: {noformat} for (PhysicalOperator root : roots) { -root.attachInput(inpTuple); +root.attachInput(tf.newTupleNoCopy(inpTuple.getAll())); } {noformat} In map(), instead of attaching the inpTuple directly to the roots of the plan, a new Tuple is created which refers to the same mFields arrayList as in inpTuple. With this change, all operators in the map plan, now work on a different Tuple reference from the one which is supplied in the map() argument (and which is reused by Hadoop). This reference will refer to the mFields of the Tuple returned from the loader which is guaranteed to be a new arraylist for each input tuple since the loader creates a new Tuple each time. Streaming is broken with the latest trunk - Key: PIG-645 URL: https://issues.apache.org/jira/browse/PIG-645 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-645.patch Several tests we run are failing now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()
RandomSampleLoader does not handle skipping correctly in getNext() -- Key: PIG-649 URL: https://issues.apache.org/jira/browse/PIG-649 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Currently RandomSampleLoader calls skip() on the underlying input stream (BufferedPositionedInputStream) in its getNext(). The input stream may not actually skip over the amount the RandomSampleLoader needs in one call. RandomSampleLoader should check the return value from the skip() call and ensure that skip() is called repeatedly (if necessary) till the needed number of bytes are skipped. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens
[ https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-651: --- Status: Patch Available (was: Open) Attached patch implements a simpler POForEachNoFlatten whenever no flattens are present in the POForEach (this is determined in LogToPhyTranslator and is also used in CombinerOptimizer for the map and combine plan foreachs). Initial tests show marginal speedup. (10 seconds out of 9 mins 50 secs in one particular group by query). PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens --- Key: PIG-651 URL: https://issues.apache.org/jira/browse/PIG-651 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-651.patch POForEach has lot of code to handle flattening (cross product) of the fields in the generate. This is relevant only when atleast one field in the generate needs to be flattened. If all fields in the generate do not need to be flattened, a more simplified and hopefully more efficient POForEach can be used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()
[ https://issues.apache.org/jira/browse/PIG-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-649: --- Status: Patch Available (was: Open) The attached patch fixes the issue by keep track of the return value of the underlying input stream's skip(). If enough bytes are not skipped on the initial call, multiple calls are made till enough bytes are skipped RandomSampleLoader does not handle skipping correctly in getNext() -- Key: PIG-649 URL: https://issues.apache.org/jira/browse/PIG-649 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-649.patch Currently RandomSampleLoader calls skip() on the underlying input stream (BufferedPositionedInputStream) in its getNext(). The input stream may not actually skip over the amount the RandomSampleLoader needs in one call. RandomSampleLoader should check the return value from the skip() call and ensure that skip() is called repeatedly (if necessary) till the needed number of bytes are skipped. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()
[ https://issues.apache.org/jira/browse/PIG-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-649: --- Attachment: PIG-649.patch RandomSampleLoader does not handle skipping correctly in getNext() -- Key: PIG-649 URL: https://issues.apache.org/jira/browse/PIG-649 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-649.patch Currently RandomSampleLoader calls skip() on the underlying input stream (BufferedPositionedInputStream) in its getNext(). The input stream may not actually skip over the amount the RandomSampleLoader needs in one call. RandomSampleLoader should check the return value from the skip() call and ensure that skip() is called repeatedly (if necessary) till the needed number of bytes are skipped. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-648) BinStorage fails when it finds markers unexpectedly in the data
[ https://issues.apache.org/jira/browse/PIG-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-648: --- Status: Patch Available (was: Open) Attached patch which now uses ctrl-A, Ctrl-B, Ctrl-C (0x01,0x02,0x03) as teh rcord markers. BinStorage fails when it finds markers unexpectedly in the data --- Key: PIG-648 URL: https://issues.apache.org/jira/browse/PIG-648 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-648.patch The current record begin marker used in BinStorage is the consecutive sequence- 0x21,0x31,0x41 - these byte correspond to the ascii characters !1A. This sequence is not very strong as a marker - this results in failures when the sequence occurs in the data - the markers should be control characters which have a high probability of not occurring in the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()
[ https://issues.apache.org/jira/browse/PIG-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-649: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed RandomSampleLoader does not handle skipping correctly in getNext() -- Key: PIG-649 URL: https://issues.apache.org/jira/browse/PIG-649 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-649.patch Currently RandomSampleLoader calls skip() on the underlying input stream (BufferedPositionedInputStream) in its getNext(). The input stream may not actually skip over the amount the RandomSampleLoader needs in one call. RandomSampleLoader should check the return value from the skip() call and ensure that skip() is called repeatedly (if necessary) till the needed number of bytes are skipped. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-428) TypeCastInserter does not replace projects in inner plans correctly
[ https://issues.apache.org/jira/browse/PIG-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669687#action_12669687 ] Pradeep Kamath commented on PIG-428: Have you tried your query with top of trunk? The original issue fixed in this issue was when the TypeCastInserter was involved in the query. That is the case only when the load statement has a schema like a = load 'bla' as (x:int, y:float);. In your query in the previous comment the load statement does not have a schema. I am wondering if the issues is somewhere else in that query but the error message is the same. TypeCastInserter does not replace projects in inner plans correctly --- Key: PIG-428 URL: https://issues.apache.org/jira/browse/PIG-428 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Fix For: types_branch Attachments: PIG-428.patch The TypeCastInserter tries to replace the Project's input operator in inner plans with the new foreach operator it adds. However it should replace only those Projects' input where the new Foreach has been added after the operator which was earlier the input to Project. Here is a query which fails due to this: {code} a = load 'st10k' as (name:chararray,age:int, gpa:double); another = load 'st10k'; c = foreach another generate $0, $1+ 10, $2 + 10; d = join a by $0, c by $0; dump d; {code} Here is the error: {noformat} 2008-09-11 23:34:28,169 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (map) tip_200809051428_0045_m_00java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableBytesWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:419) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:83) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:172) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-653) Make fieldsToRead work in loader
[ https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-653: --- Attachment: PIG-653-2.comment A new proposal has been attached as a revision of the proposal in comment 1. The two main changes are: 1. A new class RequiredFieldList will be used to convey the list of required fields. A separate class was chosen here (rather than using the ListRequiredFields and boolean separately) since it gives us the flexibility to extend it easily in the future. 2. The new type, BAG_OF_MAP is no longer needed. So if a certain field is a bag (named bg) which contains a single column which is a map and we need just the data for only one key (say k1) from it, we can represent that by having a RequiredField object of Type BAG with alias bg. This object will have one RequiredField object in its subFields list which will be of type MAP and which will have index 0 to indicate this is the first subfield in the bag. This object inturn will have one RequiredField object in its subFields list which be of type BYTEARRAY and which will have alias k1. This illustrates how subcolumns of interest can be represented by the RequiredField class. Make fieldsToRead work in loader Key: PIG-653 URL: https://issues.apache.org/jira/browse/PIG-653 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Pradeep Kamath Attachments: PIG-653-2.comment Currently pig does not call the fieldsToRead function in LoadFunc, thus it does not provide information to load functions on what fields are needed. We need to implement a visitor that determines (where possible) which fields in a file will be used and relays that information to the load function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange
Map key type not correctly set (for use when key is null) when map plan does not have localrearrange Key: PIG-665 URL: https://issues.apache.org/jira/browse/PIG-665 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the map key. This is required so that when the map key is null, we can still construct a valid NullableXXXWritable object to pass on to hadoop in the collect() call (hadoop needs a valid object even for null objects). Currently the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to figure out the key type. In a pig script which results in multiple Map reduce jobs, one of the jobs could have a map plan with only POLoads in it. In such a case, the map key type is not discovered and this results in a null being returned from HDataType.getWritableComparableTypes() method. This in turn will result in a NullPointerException in the collect(). Here is a script which can prompt this behavior: {code} a = load 'a.txt' as (x:int, y:int, z:int); b = load 'b.txt' as (x:int, y:int); b_group = group b by x; b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks; a_group = group a by (x, y); a_aggs = foreach a_group { generate flatten(group) as (x, y), SUM(a.z) as zs; }; join_a_b = join b_sum by x, a_aggs by x; -- the map plan for this join will only have two POLoads which will result in the NullPointerException at runtime in collect() dump join_a_b; {code} Contents of a.txt (columns are tab separated): The first column of the first two rows is null (represented by an empty column) {noformat} 7 8 8 9 1 20 30 1 20 40 {noformat} Contents of b.txt (columns are tab separated): {noformat} 7 2 1 5 1 10 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange
[ https://issues.apache.org/jira/browse/PIG-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-665: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed. Map key type not correctly set (for use when key is null) when map plan does not have localrearrange Key: PIG-665 URL: https://issues.apache.org/jira/browse/PIG-665 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-665.patch KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the map key. This is required so that when the map key is null, we can still construct a valid NullableXXXWritable object to pass on to hadoop in the collect() call (hadoop needs a valid object even for null objects). Currently the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to figure out the key type. In a pig script which results in multiple Map reduce jobs, one of the jobs could have a map plan with only POLoads in it. In such a case, the map key type is not discovered and this results in a null being returned from HDataType.getWritableComparableTypes() method. This in turn will result in a NullPointerException in the collect(). Here is a script which can prompt this behavior: {code} a = load 'a.txt' as (x:int, y:int, z:int); b = load 'b.txt' as (x:int, y:int); b_group = group b by x; b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks; a_group = group a by (x, y); a_aggs = foreach a_group { generate flatten(group) as (x, y), SUM(a.z) as zs; }; join_a_b = join b_sum by x, a_aggs by x; -- the map plan for this join will only have two POLoads which will result in the NullPointerException at runtime in collect() dump join_a_b; {code} Contents of a.txt (columns are tab separated): The first column of the first two rows is null (represented by an empty column) {noformat} 7 8 8 9 1 20 30 1 20 40 {noformat} Contents of b.txt (columns are tab separated): {noformat} 7 2 1 5 1 10 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-545: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed. PERFORMANCE: Sampler for order bys does not produce a good distribution --- Key: PIG-545 URL: https://issues.apache.org/jira/browse/PIG-545 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-545-v3.patch, PIG-545-v4.patch, WRP.patch, WRP1.patch In running tests on actual data, I've noticed that the final reduce of an order by has skewed partitions. Some reduces finish in a few seconds while some run for 20 minutes. Getting a better distribution should lead to much better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-652: --- Fix Version/s: types_branch Assignee: Pradeep Kamath (was: Alan Gates) Affects Version/s: types_branch Hadoop Flags: [Incompatible change] Status: Patch Available (was: Open) Submitting a patch with a few changes to the way this will work. Very soon we will have the ability to store multiple outputs in the map or reduce phase of a job (https://issues.apache.org/jira/browse/PIG-627). In that scenario the OutputFormat will still need to be able to get a handle of the corresponding StoreFunc, location and schema to use for the particular output that it is trying to write. To Enable this a Utility class - MapRedUtil is being introduced which has static methods which will take a JobConf and return these pieces of information. When PIG-627 is implemented, these utility classes will hide the inner Pig implementation to map the multiple stores to the corresponding StoreFunc, location and schema. The new method in StoreFunc proposed at the beginning of this issue will still be used to ask the StoreFunc if it will provide an OutputFormat implementation. Need to give user control of OutputFormat - Key: PIG-652 URL: https://issues.apache.org/jira/browse/PIG-652 Project: Pig Issue Type: New Feature Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-652.patch Pig currently allows users some control over InputFormat via the Slicer and Slice interfaces. It does not allow any control over OutputFormat and RecordWriter interfaces. It just allows the user to implement a storage function that controls how the data is serialized. For hadoop tables, we will need to allow custom OutputFormats that prepare output information and objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-652: --- Attachment: PIG-652.patch Need to give user control of OutputFormat - Key: PIG-652 URL: https://issues.apache.org/jira/browse/PIG-652 Project: Pig Issue Type: New Feature Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-652.patch Pig currently allows users some control over InputFormat via the Slicer and Slice interfaces. It does not allow any control over OutputFormat and RecordWriter interfaces. It just allows the user to implement a storage function that controls how the data is serialized. For hadoop tables, we will need to allow custom OutputFormats that prepare output information and objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-652: --- Attachment: PIG-652-v2.patch Attached new version which addresses the comment regarding having a serialVersionUID in StoreConfig since it its Serializable. Also removed a redundant import from StoreFunc.java. Moving the string constants to a separate PropertyKeys.java file would be good but to be useful will need all existing free standing constants to be moved there - this would be good in a separate jira as suggested. Having schemas in all PhysicalOperators might be good but would need all the visitors (optimizers etc) which run after LogToPhyTranslationVisitor and which introduce new PhysicalOperators to also generate schema for them. Likewise these visitor would have to handle schema changes on operators they modify - this might be better pursued in a different jira is found to be worthwhile. Need to give user control of OutputFormat - Key: PIG-652 URL: https://issues.apache.org/jira/browse/PIG-652 Project: Pig Issue Type: New Feature Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-652-v2.patch, PIG-652.patch Pig currently allows users some control over InputFormat via the Slicer and Slice interfaces. It does not allow any control over OutputFormat and RecordWriter interfaces. It just allows the user to implement a storage function that controls how the data is serialized. For hadoop tables, we will need to allow custom OutputFormats that prepare output information and objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-652: --- Attachment: PIG-652-v3.patch Attached a new version of the patch. Changes include: 1) Included MapredUtil.java source file which was missing in the previous patch 2) Fixed a few issues which were uncovered in some tests 2) regenerated the patch with the latest svn revision Need to give user control of OutputFormat - Key: PIG-652 URL: https://issues.apache.org/jira/browse/PIG-652 Project: Pig Issue Type: New Feature Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652.patch Pig currently allows users some control over InputFormat via the Slicer and Slice interfaces. It does not allow any control over OutputFormat and RecordWriter interfaces. It just allows the user to implement a storage function that controls how the data is serialized. For hadoop tables, we will need to allow custom OutputFormats that prepare output information and objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-591) Error handling phase four
[ https://issues.apache.org/jira/browse/PIG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676399#action_12676399 ] Pradeep Kamath commented on PIG-591: Code review comments The patch looks good to go with minor observations below: - System.err.println() message in PgHadoopLogger.warn() seems like a debug statement - In EvalFunc.progress() there is: log.warn(No reporter object provided to UDF + this.getClass().getName()); Shouldn't this go through the PigLogger? - If we want warning aggregation in UDF, should the UDF writer create new entries in PigWarning (If so, the UDF manual should probably outline this) - Is there a reason why initialized needs to be volatile in PigMapBase? There should be only one Map thread in the map() function. If there is a reason for it to be volatile, does it apply to PigMapReduce, PigCombiner and POUserFunc as well? - In POUserFunc.instantiateFunc() should we still set the Reporter and PigLogger if the assignments don't actually work and we rely on processinput() for these initializations? - In DefaultAbstractBag warn() should mimic Utf8StorageConvertor - GruntParser.java has only a whitespace change (the change should be reverted since earlier there were spaces and now there is a tab). Error handling phase four - Key: PIG-591 URL: https://issues.apache.org/jira/browse/PIG-591 Project: Pig Issue Type: Sub-task Components: grunt, impl, tools Affects Versions: types_branch Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: types_branch Attachments: Error_handling_phase4.patch Phase four of the error handling feature will address the warning message cleanup and warning message aggregation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-591) Error handling phase four
[ https://issues.apache.org/jira/browse/PIG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-591. Resolution: Fixed Hadoop Flags: [Reviewed] Santhosh, thanks for the feature contribution. Patch commited with the following changes in HExecution.java: {code} @@ -200,7 +200,7 @@ } catch (IOException e) { int errCode = 6009; -String msg = Failed to create job client; +String msg = Failed to create job client: + e.getMessage(); throw new ExecException(msg, errCode, PigException.BUG, e); } } @@ -549,11 +549,20 @@ //this should return as soon as connection is shutdown int rc = p.waitFor(); if (rc != 0) { -String errMsg = new String(); +StringBuilder errMsg = new StringBuilder(); try { -BufferedReader br = new BufferedReader(new InputStreamReader(p.getErrorStream())); -errMsg = br.readLine(); +BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream())); +String line = null; +while((line = br.readLine()) != null) { +errMsg.append(line); +} br.close(); +br = new BufferedReader(new InputStreamReader(p.getErrorStream())); +line = null; +while((line = br.readLine()) != null) { +errMsg.append(line); +} +br.close(); } catch (IOException ioe) {} int errCode = 6011; StringBuilder msg = new StringBuilder(Failed to run command ); @@ -563,7 +572,7 @@ msg.append(; return code: ); msg.append(rc); msg.append(; error: ); -msg.append(errMsg); +msg.append(errMsg.toString()); throw new ExecException(msg.toString(), errCode, PigException.REMOTE_ENVIRONMENT); } } catch (Exception e){ {code} These extra changes are need so that the right error message is shown when there is an error while connecting to DFS. Since this is the last error handling related patch it seemed logical to add this with this patch. The above change has been taken from the patch submitted for http://issues.apache.org/jira/browse/PIG-682. So when http://issues.apache.org/jira/browse/PIG-682 is finally committed this portion can be omitted. Error handling phase four - Key: PIG-591 URL: https://issues.apache.org/jira/browse/PIG-591 Project: Pig Issue Type: Sub-task Components: grunt, impl, tools Affects Versions: types_branch Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: types_branch Attachments: Error_handling_phase4.patch, Error_handling_phase4_1.patch Phase four of the error handling feature will address the warning message cleanup and warning message aggregation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-682) Fix the ssh tunneling code
[ https://issues.apache.org/jira/browse/PIG-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676810#action_12676810 ] Pradeep Kamath commented on PIG-682: As noted in https://issues.apache.org/jira/browse/PIG-591?focusedCommentId=12676808#action_12676808 a part of this patch has already been committed as part of https://issues.apache.org/jira/browse/PIG-591. The portion which has already been committed is in HExecutionEngine.java: {code} @@ -200,7 +200,7 @@ } catch (IOException e) { int errCode = 6009; -String msg = Failed to create job client; +String msg = Failed to create job client: + e.getMessage(); throw new ExecException(msg, errCode, PigException.BUG, e); } } @@ -549,11 +549,20 @@ //this should return as soon as connection is shutdown int rc = p.waitFor(); if (rc != 0) { -String errMsg = new String(); +StringBuilder errMsg = new StringBuilder(); try { -BufferedReader br = new BufferedReader(new InputStreamReader(p.getErrorStream())); -errMsg = br.readLine(); +BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream())); +String line = null; +while((line = br.readLine()) != null) { +errMsg.append(line); +} br.close(); +br = new BufferedReader(new InputStreamReader(p.getErrorStream())); +line = null; +while((line = br.readLine()) != null) { +errMsg.append(line); +} +br.close(); } catch (IOException ioe) {} int errCode = 6011; StringBuilder msg = new StringBuilder(Failed to run command ); @@ -563,7 +572,7 @@ msg.append(; return code: ); msg.append(rc); msg.append(; error: ); -msg.append(errMsg); +msg.append(errMsg.toString()); throw new ExecException(msg.toString(), errCode, PigException.REMOTE_ENVIRONMENT); } } catch (Exception e){ {code} When a new revision of this patch is generated to make the changes for the previous review comment, the above portion of code changes can be omitted. Fix the ssh tunneling code -- Key: PIG-682 URL: https://issues.apache.org/jira/browse/PIG-682 Project: Pig Issue Type: Bug Components: impl Reporter: Benjamin Reed Attachments: jsch-0.1.41.jar, PIG-682.patch Hadoop has changed a bit and the ssh-gateway code no longer works. pig needs to be updated to register with the new socket framework. reporting of problems also needs to be better. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-652: --- Attachment: PIG-652-v4.patch Attaching new patch - the only difference is: (old code is the first input in the diff below) {code} --- src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java (revision 747112) --- --- src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java (revision 748740) 146c146 +if(sPrepClass != null sPrepClass.isInstance(OutputFormat.class)) { --- +if(sPrepClass != null OutputFormat.class.isAssignableFrom(sPrepClass)) { {code} The code achives checking whether the class supplied by the Loader is of type OutputFormat Need to give user control of OutputFormat - Key: PIG-652 URL: https://issues.apache.org/jira/browse/PIG-652 Project: Pig Issue Type: New Feature Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652-v4.patch, PIG-652.patch Pig currently allows users some control over InputFormat via the Slicer and Slice interfaces. It does not allow any control over OutputFormat and RecordWriter interfaces. It just allows the user to implement a storage function that controls how the data is serialized. For hadoop tables, we will need to allow custom OutputFormats that prepare output information and objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-652) Need to give user control of OutputFormat
[ https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677603#action_12677603 ] pkamath edited comment on PIG-652 at 2/27/09 4:06 PM: - Attaching new patch - the only difference is: (old code is the first input in the diff below) {code} --- src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java (revision 747112) --- --- src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java (revision 748740) 146c146 +if(sPrepClass != null sPrepClass.isInstance(OutputFormat.class)) { --- +if(sPrepClass != null OutputFormat.class.isAssignableFrom(sPrepClass)) { {code} The code achives checking whether the class supplied by the Storefunc is of type OutputFormat was (Author: pkamath): Attaching new patch - the only difference is: (old code is the first input in the diff below) {code} --- src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java (revision 747112) --- --- src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java (revision 748740) 146c146 +if(sPrepClass != null sPrepClass.isInstance(OutputFormat.class)) { --- +if(sPrepClass != null OutputFormat.class.isAssignableFrom(sPrepClass)) { {code} The code achives checking whether the class supplied by the Loader is of type OutputFormat Need to give user control of OutputFormat - Key: PIG-652 URL: https://issues.apache.org/jira/browse/PIG-652 Project: Pig Issue Type: New Feature Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652-v4.patch, PIG-652.patch Pig currently allows users some control over InputFormat via the Slicer and Slice interfaces. It does not allow any control over OutputFormat and RecordWriter interfaces. It just allows the user to implement a storage function that controls how the data is serialized. For hadoop tables, we will need to allow custom OutputFormats that prepare output information and objects needed by a Table store function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-691) BinStorage skips tuples when ^A is present in data
[ https://issues.apache.org/jira/browse/PIG-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-691: --- Fix Version/s: types_branch Affects Version/s: types_branch Status: Patch Available (was: Open) Binstorage uses RECORD_1, RECORD_2 and RECORD_3 byte markers (the bytes 0x01, 0x02, 0x03) as the beginning of a new record. The current bug in BinStorage is that in getNext(), the code looks for RECORD_1 and if it finds RECORD_1, it looks for RECORD_2. If it fails to find RECORD_2, it goes back to look for entire sequence starting with looking for RECORD_1. However this failes when we have the following sequence:RECORD_1-RECORD_1-RECORD_2-RECORD_3. After reading the second RECORD_1 in the above sequence, we should not look for RECORD_1 again but start by looking for RECORD_2. This is an issue only when a record in binstorage spans two blocks and the part in the head of the second block has the above sequence. This can happen when the last field in the record is null (null is represented by the byte 0x01 which is RECORD_1). The attached patch fixes this issue. BinStorage skips tuples when ^A is present in data -- Key: PIG-691 URL: https://issues.apache.org/jira/browse/PIG-691 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: types_branch Pradeep found a problem with BinStorage.getNext function that causes data loss. He is working on the fix -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-691) BinStorage skips tuples when ^A is present in data
[ https://issues.apache.org/jira/browse/PIG-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-691: --- Attachment: PIG-691.patch BinStorage skips tuples when ^A is present in data -- Key: PIG-691 URL: https://issues.apache.org/jira/browse/PIG-691 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-691.patch Pradeep found a problem with BinStorage.getNext function that causes data loss. He is working on the fix -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-690) UNION doesn't work in the latest code
[ https://issues.apache.org/jira/browse/PIG-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-690: --- Attachment: PIG-690.patch UNION doesn't work in the latest code - Key: PIG-690 URL: https://issues.apache.org/jira/browse/PIG-690 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: mapred mode. local mode.has the same problem under linux. code is taken from trunk Reporter: Amir Youssefi Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-690.patch grunt a = load 'tmp/f1' using BinStorage(); grunt b = load 'tmp/f2' using BinStorage(); grunt describe a; a: {int,chararray,int,{(int,chararray,chararray)}} grunt describe b; b: {int,chararray,int,{(int,chararray,chararray)}} grunt c = union a,b; grunt describe c; 2009-02-27 11:51:46,012 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1052: Cannot cast bag with schema bag({(int,chararray,chararray)}) to tuple with schema tuple Details at logfile: /homes/amiry/pig_1235735380348.log dump a and dump b work fine. Sample data provided to dev team in an e-mail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-655) Comparison of schemas of bincond operands is flawed
[ https://issues.apache.org/jira/browse/PIG-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678210#action_12678210 ] Pradeep Kamath commented on PIG-655: I will be reviewing this patch Comparison of schemas of bincond operands is flawed --- Key: PIG-655 URL: https://issues.apache.org/jira/browse/PIG-655 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: types_branch Attachments: PIG-655.patch The comparison of schemas of bincond is flawed. Instead of comparing the field schemas, the type checker is comparing the schemas. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-692) when running script file, automatically set up job name based on the file name
[ https://issues.apache.org/jira/browse/PIG-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678458#action_12678458 ] Pradeep Kamath commented on PIG-692: +1 for the change when running script file, automatically set up job name based on the file name -- Key: PIG-692 URL: https://issues.apache.org/jira/browse/PIG-692 Project: Pig Issue Type: Improvement Components: tools Affects Versions: types_branch Reporter: Vadim Zaliva Priority: Trivial Fix For: types_branch Attachments: pig-job-name.patch When running pig script from command like like this: pig scriptfile right now default job name is used. it is convenient to have it automatically set up based on the script name. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-577) outer join query looses name information
[ https://issues.apache.org/jira/browse/PIG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-577. Resolution: Fixed outer join query looses name information Key: PIG-577 URL: https://issues.apache.org/jira/browse/PIG-577 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: types_branch Attachments: PIG-577.patch The following query: A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararray, contributions: float); C = COGROUP A BY name, B BY name; D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), flatten((IsEmpty(B) ? null : B)); describe D; E = FOREACH D GENERATE A::gpa, B::contributions; Give the following error: (Even though describe shows correct information: D: {group: chararray,A::name: chararray,A::age: int,A::gpa: float,B::name: chararray,B::age: int,B::registration: chararray,B::contributions: float} java.io.IOException: Invalid alias: A::gpa in {group: chararray,bytearray,bytearray} at org.apache.pig.PigServer.parseQuery(PigServer.java:298) at org.apache.pig.PigServer.registerQuery(PigServer.java:263) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Invalid alias: A::gpa in {group: chararray,bytearray,bytearray} at org.apache.pig.impl.logicalLayer.parser.QueryParser.AliasFieldOrSpec(QueryParser.java:5930) at org.apache.pig.impl.logicalLayer.parser.QueryParser.ColOrSpec(QueryParser.java:5788) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseEvalSpec(QueryParser.java:3974) at org.apache.pig.impl.logicalLayer.parser.QueryParser.UnaryExpr(QueryParser.java:3871) at org.apache.pig.impl.logicalLayer.parser.QueryParser.CastExpr(QueryParser.java:3825) at org.apache.pig.impl.logicalLayer.parser.QueryParser.MultiplicativeExpr(QueryParser.java:3734) at org.apache.pig.impl.logicalLayer.parser.QueryParser.AdditiveExpr(QueryParser.java:3660) at org.apache.pig.impl.logicalLayer.parser.QueryParser.InfixExpr(QueryParser.java:3626) at org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItem(QueryParser.java:3552) at org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItemList(QueryParser.java:3462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.GenerateStatement(QueryParser.java:3419) at org.apache.pig.impl.logicalLayer.parser.QueryParser.NestedBlock(QueryParser.java:2894) at org.apache.pig.impl.logicalLayer.parser.QueryParser.ForEachClause(QueryParser.java:2309) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:966) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:742) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:537) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:60) at org.apache.pig.PigServer.parseQuery(PigServer.java:295) ... 6 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-577) outer join query looses name information
[ https://issues.apache.org/jira/browse/PIG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-577: --- Hadoop Flags: [Reviewed] +1, Patch committed - thanks for the fix Santhosh. outer join query looses name information Key: PIG-577 URL: https://issues.apache.org/jira/browse/PIG-577 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Olga Natkovich Assignee: Santhosh Srinivasan Fix For: types_branch Attachments: PIG-577.patch The following query: A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararray, contributions: float); C = COGROUP A BY name, B BY name; D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), flatten((IsEmpty(B) ? null : B)); describe D; E = FOREACH D GENERATE A::gpa, B::contributions; Give the following error: (Even though describe shows correct information: D: {group: chararray,A::name: chararray,A::age: int,A::gpa: float,B::name: chararray,B::age: int,B::registration: chararray,B::contributions: float} java.io.IOException: Invalid alias: A::gpa in {group: chararray,bytearray,bytearray} at org.apache.pig.PigServer.parseQuery(PigServer.java:298) at org.apache.pig.PigServer.registerQuery(PigServer.java:263) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Invalid alias: A::gpa in {group: chararray,bytearray,bytearray} at org.apache.pig.impl.logicalLayer.parser.QueryParser.AliasFieldOrSpec(QueryParser.java:5930) at org.apache.pig.impl.logicalLayer.parser.QueryParser.ColOrSpec(QueryParser.java:5788) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseEvalSpec(QueryParser.java:3974) at org.apache.pig.impl.logicalLayer.parser.QueryParser.UnaryExpr(QueryParser.java:3871) at org.apache.pig.impl.logicalLayer.parser.QueryParser.CastExpr(QueryParser.java:3825) at org.apache.pig.impl.logicalLayer.parser.QueryParser.MultiplicativeExpr(QueryParser.java:3734) at org.apache.pig.impl.logicalLayer.parser.QueryParser.AdditiveExpr(QueryParser.java:3660) at org.apache.pig.impl.logicalLayer.parser.QueryParser.InfixExpr(QueryParser.java:3626) at org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItem(QueryParser.java:3552) at org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItemList(QueryParser.java:3462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.GenerateStatement(QueryParser.java:3419) at org.apache.pig.impl.logicalLayer.parser.QueryParser.NestedBlock(QueryParser.java:2894) at org.apache.pig.impl.logicalLayer.parser.QueryParser.ForEachClause(QueryParser.java:2309) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:966) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:742) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:537) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:60) at org.apache.pig.PigServer.parseQuery(PigServer.java:295) ... 6 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679021#action_12679021 ] Pradeep Kamath commented on PIG-627: I committed multi-store-0304.patch into the multi-query branch after reviewing the changes. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Olga Natkovich Fix For: types_branch Attachments: multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680724#action_12680724 ] Pradeep Kamath commented on PIG-627: multiquery_0306.patch seems to have a lot of code from the earlier patch ( multi-store-0304.patch). Richard, can you svn up your code base and regenerate the patch with only the changes you intended? PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680997#action_12680997 ] Pradeep Kamath commented on PIG-627: Sorry about the misunderstanding, I think I looked at a different patch. After reviewing the right patch, here are some comments: The patch throws Java Exceptions like IllegalStateException. This should be replaced with the appropriate Exception class (like MRCompilerException) as specified in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification. The exception should be created with the error code, error source and error message constructor. New error codes should be introduced if one of the existing ones in http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification#head-9f71d78d362c3307711f98ec9db3ee12b55e92f6 cannot be used. If new codes are introduced, the wiki table should be updated. The following can be used to check for file existence in BinStorage.determineSchema() - only in the case where the file does not exist, null should be returned {code} public static boolean fileExists(String filename, DataStorage store) throws IOException { ElementDescriptor elem = store.asElement(filename); return elem.exists() || globMatchesFiles(elem, store); } {code} Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I wonder if we should have a ReverseDependencyOrderWalker since that is what the rootsFirst == false case will be. If we are not visiting roots to leaf, we really are not visiting in a dependency order - so the meaning of dependency order is no longer honored - this can be confusing I think. By explicitly naming the walker ReverseDependencyOrderWalker, the intent of walking from leaves to roots is more clear I think. In POSplit currently there is a PhysicalPlan representing the merged inner plans (where all plans are mutually exclusive) and there is also a ListPhysicalPlan which has the same information in the form of a List. In the rest of pig code, inner plans have always been modelled as ListPhysicalPlan. For consistency, it is better to just have a ListPhysicalPlan to represent the inner plans. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681085#action_12681085 ] Pradeep Kamath commented on PIG-627: Committed patch per previous comment that the review comments will be addressed in the next patch - thanks Richard for the contribution. In general from Pig code we always want to throw known PigExceptions even for programming errors or internal state errors - in these cases, we just use the source of the Exception as PigExcetion.BUG. RuntimeException should be used when we want to throw an exception in a function which cannot throw any exceptions (like in methods from Hadoop API which we are implementing which do not throw any Exception) PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688339#action_12688339 ] Pradeep Kamath commented on PIG-627: Comments for Richard's patch - multiquery-phase2_0313.patch In MultiQueryOptimizer: - what about mr not being map only and with mr splittee? - is this not handled for now? - Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain? - There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication. Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good. spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker. The following comment in BinStorage needs to be clarified: {noformat} if (!FileLocalizer.fileExists(fileName, storage)) { // At compile time in batch mode, the file may not exist // (such as intermediate file). Just return null - the // same way as we could's get a valid record from the input. -- does this actually mean the same way as we would if we did not get a valid record ? return null; } {noformat} PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688356#action_12688356 ] Pradeep Kamath commented on PIG-627: +1 on Gunther's patch - multiquery_explain_fix.patch. Patch has been committed to the multiquery branch - thanks for the fix Gunther! PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688461#action_12688461 ] Pradeep Kamath commented on PIG-627: +1 on Richard's patch - multiquery-phase2_0323.patch, patch committed to multiquery branch - thanks for the contribution Richard. A general comment for the multiquery work is to introduce some negative test cases (which return POStatus.STATUS_ERR from some operator in the map or reduce plan affected by the multiQuqeryOptimizer) at some point. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-729) Use of default parallelism
[ https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688755#action_12688755 ] Pradeep Kamath commented on PIG-729: Another option maybe to detect mapreduce boundaries in the script which do not have a parallel specification and prompt the user to input a parallel number they want to use for all such mapreduce boundaries (default being 1). This way users are given an opportunity at submit time to specify parallelism if they forgot to do so in the script. Use of default parallelism -- Key: PIG-729 URL: https://issues.apache.org/jira/browse/PIG-729 Project: Pig Issue Type: Bug Components: impl Affects Versions: 1.0.1 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Fix For: 1.0.1 Currently, if the user does not specify the number of reduce slots using the parallel keyword, Pig lets Hadoop decide on the default number of reducers. This model worked well with dynamically allocated clusters using HOD and for static clusters where the default number of reduce slots was explicitly set. With Hadoop 0.20, a single static cluster will be shared amongst a number of queues. As a result, a common scenario is to end up with default number of reducers set to one (1). When users migrate to Hadoop 0.20, they might see a dramatic change in the performance of their queries if they had not used the parallel keyword to specify the number of reducers. In order to mitigate such circumstances, Pig can support one of the following: 1. Specify a default parallelism for the entire script. This option will allow users to use the same parallelism for all operators that do not have the explicit parallel keyword. This will ensure that the scripts utilize more reducers than the default of one reducer. On the down side, due to data transformations, usually operations that are performed towards the end of the script will need smaller number of reducers compared to the operators that appear at the beginning of the script. 2. Display a warning message for each reduce side operator that does have the use of the explicit parallel keyword. Proceed with the execution. 3. Display an error message indicating the operator that does not have the explicit use of the parallel keyword. Stop the execution. Other suggestions/thoughts/solutions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688957#action_12688957 ] Pradeep Kamath commented on PIG-627: +1 - committed patch by Gunther to merge changes in trunk to multiquery branch - thanks for the contribution Gunther. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input
Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input --- Key: PIG-733 URL: https://issues.apache.org/jira/browse/PIG-733 Project: Pig Issue Type: Bug Reporter: Pradeep Kamath Assignee: Pradeep Kamath Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with FileSystem closed error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input
[ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-733: --- Fix Version/s: 0.3.0 Affects Version/s: 0.2.0 Status: Patch Available (was: Open) Attached patch which implements the fix described in the description of the issue Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input --- Key: PIG-733 URL: https://issues.apache.org/jira/browse/PIG-733 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with FileSystem closed error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input
[ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-733: --- Attachment: PIG-733.patch Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input --- Key: PIG-733 URL: https://issues.apache.org/jira/browse/PIG-733 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-733.patch Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with FileSystem closed error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694859#action_12694859 ] Pradeep Kamath commented on PIG-627: +1, patch committed - thanks for the contribution Gunther. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input
[ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696244#action_12696244 ] Pradeep Kamath commented on PIG-733: Tests are not included in this patch since there are existing tests for order by. All core unit tests did pass and finbugs gave the same number of warnings with and without the patch (output below). The excess warnings produced by the patch have been addressed in the new version of the patch (PIG-733-v2.patch). {noformat} === CORE UNIT TESTS OUTPUT WITH PATCH [prade...@afterside:/tmp/PIG-733/trunk] test-core: [mkdir] Created dir: /tmp/PIG-733/trunk/build/test/logs [junit] Running org.apache.pig.test.TestAdd [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.056 sec ... [junit] Running org.apache.pig.test.TestTypeCheckingValidatorNoSchema [junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.629 sec [junit] Running org.apache.pig.test.TestUnion [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 49.94 sec test-contrib: BUILD SUCCESSFUL Total time: 77 minutes 47 seconds === FINDBUGS OUTPUT WITH PATCH [prade...@afterside:/tmp/PIG-733/trunk] [prade...@chargesize:/tmp/PIG-733/trunk]ant -Dfindbugs.home=/homes/pradeepk/findbugs-1.3.8 findbugs Buildfile: build.xml ... findbugs: [mkdir] Created dir: /tmp/PIG-733/trunk/build/test/findbugs [findbugs] Executing findbugs from ant task [findbugs] Running FindBugs... [findbugs] Warnings generated: 665 [findbugs] Calculating exit code... [findbugs] Setting 'bugs found' flag (1) [findbugs] Exit code set to: 1 [findbugs] Java Result: 1 [findbugs] Output saved to /tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.xml [xslt] Processing /tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.xml to /tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.html [xslt] Loading stylesheet /homes/pradeepk/findbugs-1.3.8/src/xsl/default.xsl === FINDBUGS OUTPUT WITHOUT PATCH [prade...@chargesize:/tmp/svncheckout/trunk]ant -Dfindbugs.home=/homes/pradeepk/findbugs-1.3.8 findbugs Buildfile: build.xml check-for-findbugs: ... findbugs: [mkdir] Created dir: /tmp/svncheckout/trunk/build/test/findbugs [findbugs] Executing findbugs from ant task [findbugs] Running FindBugs... [findbugs] Warnings generated: 665 [findbugs] Calculating exit code... [findbugs] Setting 'bugs found' flag (1) [findbugs] Exit code set to: 1 [findbugs] Java Result: 1 [findbugs] Output saved to /tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.xml [xslt] Processing /tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.xml to /tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.html [xslt] Loading stylesheet /homes/pradeepk/findbugs-1.3.8/src/xsl/default.xsl {noformat} Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input --- Key: PIG-733 URL: https://issues.apache.org/jira/browse/PIG-733 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-733.patch Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with FileSystem closed error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input
[ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-733: --- Attachment: PIG-733-v2.patch Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input --- Key: PIG-733 URL: https://issues.apache.org/jira/browse/PIG-733 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-733-v2.patch, PIG-733.patch Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with FileSystem closed error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696350#action_12696350 ] Pradeep Kamath commented on PIG-627: +1, patch committed. Thanks for the contribution Gunther! PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: file_cmds-0305.patch, fix_store_prob.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input
[ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-733: --- Resolution: Fixed Status: Resolved (was: Patch Available) patch committed Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input --- Key: PIG-733 URL: https://issues.apache.org/jira/browse/PIG-733 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-733-v2.patch, PIG-733.patch Order by has a sampling job which samples the input and creates a sorted list of sample items. CUrrently the number of items sampled is 100 per map task. So if the input is large resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating values in each map by reading the samples file from DFS. In queries with many maps (in the order of 50,000) the dfs read of the sample file fails with FileSystem closed error. This seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case all maps) causes the clients to be closed. However on the pig side, loading the sample from each map in the final map reduce job and computing the quantile boundaries and weighted probabilities is inefficient. We should do this computation through a FindQuantiles udf in the same map reduce job which produces the sorted samples. This way lesser data is written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just load the computed information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-739) Filter in foreach seems to drop records resulting in decreased count of records
[ https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-739. Resolution: Duplicate Assignee: Pradeep Kamath This issue has the same root cause of PIG-514 - hence marking this duplicate - the fix for this issue will also be tracked in PIG-514 Filter in foreach seems to drop records resulting in decreased count of records --- Key: PIG-739 URL: https://issues.apache.org/jira/browse/PIG-739 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: filter_distinctbug.pig, testdata I have a Pig script in which I count the number of distinct records resulting from the filter, this statement is embedded in a foreach. The number of records I get with alias TESTDATA_AGG_2 is 1. {code} TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int); TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '123080040' and timestamp lt '123080400' and value != 0); TESTDATA_GROUP = group TESTDATA_FILTERED by testid; TESTDATA_AGG = foreach TESTDATA_GROUP { A = filter TESTDATA_FILTERED by (userid eq sessionid); C = distinct A.userid; generate group as testid, COUNT(TESTDATA_FILTERED) as counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as total_flags; } TESTDATA_AGG_1 = group TESTDATA_AGG ALL; -- count records generated through nested foreach which contains distinct TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG); --explain TESTDATA_AGG_2; dump TESTDATA_AGG_2; --RESULT (1L) {code} But when I do the counting of records without the filter and distinct in the foreach I get a different value (20L) {code} TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int); TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '123080040' and timestamp lt '123080400' and value != 0); TESTDATA_GROUP = group TESTDATA_FILTERED by testid; -- count records generated through simple foreach TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as total_flags; TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL; TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2); dump TESTDATA_AGG2_2; --RESULT (20L) {code} Attaching testdata -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH
[ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699801#action_12699801 ] Pradeep Kamath commented on PIG-514: I am currently working on implementing the above proposal since I have not seen any objections. After making the core changes to implement the above proposal, I validated that it fixed the issue reported here and also in PIG-739 and PIG-710. I need to add a few more changes to make the patch complete - will supply a patch once done. COUNT returns no results as a result of two filter statements in FOREACH Key: PIG-514 URL: https://issues.apache.org/jira/browse/PIG-514 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Attachments: mystudentfile.txt For the following piece of sample code in FOREACH which counts the filtered student records based on record_type == 1 and scores and also on record_type == 0 does not seem to return any results. {code} mydata = LOAD 'mystudentfile.txt' AS (record_type,name,age,scores,gpa); --keep only what we need mydata_filtered = FOREACH mydata GENERATE record_type, name, age, scores ; --group mydata_grouped = GROUP mydata_filtered BY (record_type,age); myfinaldata = FOREACH mydata_grouped { myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores; myfilter2 = FILTER mydata_filtered BY record_type == 0; GENERATE FLATTEN(group), -- Only this count causes the problem ?? COUNT(myfilter1) as col2, SUM(myfilter2.scores) as col3, COUNT(myfilter2) as col4; }; --these set of statements confirm that the count on the filters returns 1 --mycountdata = FOREACH mydata_grouped --{ -- myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores; -- GENERATE -- COUNT(myfilter1) as colcount; --}; --dump mycountdata; dump myfinaldata; {code} But if you uncomment the {code} COUNT(myfilter1) as col2, {code}, it seems to work with the following results.. (0,22,45.0,2L) (0,24,133.0,6L) (0,25,22.0,1L) Also I have tried to verify if this is a issue with the {code} COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the case. If {code} dump mycountdata; {code} is uncommented it returns: (1L) (1L) I am attaching the tab separated 'mystudentfile.txt' file used in this Pig script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on these filters?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700925#action_12700925 ] Pradeep Kamath commented on PIG-627: reviewed error_handling_0416.patch for additional changes per comment: https://issues.apache.org/jira/browse/PIG-627?focusedCommentId=1260page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_1260. +1, committed after removing the javadoc related changes which were already committed in the previous commit. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported
Empty complex constants (empty bag, empty tuple and empty map) should be supported -- Key: PIG-773 URL: https://issues.apache.org/jira/browse/PIG-773 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Priority: Minor We should be able to create empty bag constant using {}, empty tuple constant using (), empty map constant using [] within a pig script -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH
[ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-514. Resolution: Fixed Fix Version/s: 0.3.0 Hadoop Flags: [Reviewed] Patch committed with the change in previous comment. COUNT returns no results as a result of two filter statements in FOREACH Key: PIG-514 URL: https://issues.apache.org/jira/browse/PIG-514 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: mystudentfile.txt, PIG-514.patch For the following piece of sample code in FOREACH which counts the filtered student records based on record_type == 1 and scores and also on record_type == 0 does not seem to return any results. {code} mydata = LOAD 'mystudentfile.txt' AS (record_type,name,age,scores,gpa); --keep only what we need mydata_filtered = FOREACH mydata GENERATE record_type, name, age, scores ; --group mydata_grouped = GROUP mydata_filtered BY (record_type,age); myfinaldata = FOREACH mydata_grouped { myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores; myfilter2 = FILTER mydata_filtered BY record_type == 0; GENERATE FLATTEN(group), -- Only this count causes the problem ?? COUNT(myfilter1) as col2, SUM(myfilter2.scores) as col3, COUNT(myfilter2) as col4; }; --these set of statements confirm that the count on the filters returns 1 --mycountdata = FOREACH mydata_grouped --{ -- myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores; -- GENERATE -- COUNT(myfilter1) as colcount; --}; --dump mycountdata; dump myfinaldata; {code} But if you uncomment the {code} COUNT(myfilter1) as col2, {code}, it seems to work with the following results.. (0,22,45.0,2L) (0,24,133.0,6L) (0,25,22.0,1L) Also I have tried to verify if this is a issue with the {code} COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the case. If {code} dump mycountdata; {code} is uncommented it returns: (1L) (1L) I am attaching the tab separated 'mystudentfile.txt' file used in this Pig script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on these filters?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702005#action_12702005 ] Pradeep Kamath commented on PIG-627: All the work till now (phase 1 and phase2) has now been committed to trunk. A tag (pre-multiquery-phase2) was created prior to commiting the multi query work since this a significantly big patch. The tag will serve as a baseline to trace down regressions. PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-775) PORelationToExprProject should create a NonSpillableDataBag to create empty bags
[ https://issues.apache.org/jira/browse/PIG-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath resolved PIG-775. Resolution: Fixed Patch committed. PORelationToExprProject should create a NonSpillableDataBag to create empty bags Key: PIG-775 URL: https://issues.apache.org/jira/browse/PIG-775 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Priority: Minor Fix For: 0.3.0 Attachments: PIG-775.patch PORelationToExprProject currently uses BagFactory.newDefaultBag() to create an empty bag in cases where it has to send an empty bag on EOP - each such empty bag created will be registered with the SpillableMemoryManager as a spillable bag. Since it is an empty bag, it really should not be registered as a spillable bag. For this, NonSpillableDataBag can be used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
[ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707064#action_12707064 ] Pradeep Kamath commented on PIG-802: PIG-744 is a duplicate - will be marking that one as duplicate. Pasting the summary from PIG-744 which has a little more detail: Currently order by results in multiple map reduce jobs (2 or 3 depending on the script) of which the last one does the actual ordering. In this last map reduce job, we create a bag of values (each value being the entire tuple that is getting sorted) for each sort key(s) using POPackage in the reduce phase. Then we turn around and flatten the bag in the foreach following the package. So there is really no need for the bag. But to be generic and use the existing operators, we can be more efficient by tagging the POPackage to create bags which are backed by the Hadoop iterator itself. This way we do not create a bag by making a copy of each tuple from the hadoop iterator. This should help both performance and scalability by making better use of memory. PERFORMANCE: not creating bags for ORDER BY --- Key: PIG-802 URL: https://issues.apache.org/jira/browse/PIG-802 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator) Key: PIG-807 URL: https://issues.apache.org/jira/browse/PIG-807 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently all bags resulting from a group or cogroup are materialized as bags containing all of the contents. The issue with this is that if a particular key has many corresponding values, all these values get stuffed in a bag which may run out of memory and hence spill causing slow down in performance and sometime memory exceptions. In many cases, the udfs which use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional read-once manner. This can be implemented by having the bag implement its iterator by simply iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is also needed in http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for this issue too. The other part of this issue is to have some way for the udfs to communicate to Pig that any input bags that they need are read once bags . This can be achieved by having an Interface - say UsesReadOnceBags which is serves as a tag to indicate the intent to Pig. Pig can then rewire its execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
[ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708551#action_12708551 ] Pradeep Kamath commented on PIG-802: Adding some more details: A new kind of bag - ReadOnceBag needs to be implemented. This bag will have reference to the key currently being processed and the iterator to values provided by hadoop in reduce(). The ReadOnceBag's iterator will simply iterate over the hadoop iterator at each call and construct a tuple by using the key and value (see POPackage.java for details on how this is done). POPackage should also be changed or a new class introduced which creates ReadOnceBags instead of regular bags. This creation of the bag should only initialize the bag with the key and iterator. PERFORMANCE: not creating bags for ORDER BY --- Key: PIG-802 URL: https://issues.apache.org/jira/browse/PIG-802 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-804) problem with lineage with double map redirection
[ https://issues.apache.org/jira/browse/PIG-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-804: --- Fix Version/s: 0.3.0 Affects Version/s: 0.2.1 Status: Patch Available (was: Open) problem with lineage with double map redirection Key: PIG-804 URL: https://issues.apache.org/jira/browse/PIG-804 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: 0.3.0 v1 = load 'data' as (s,m,l); v2 = foreach v1 GENERATE s#'src_spaceid' AS vspaceid ; v3 = foreach v2 GENERATE (chararray)vspaceid#'foo'; explain v3; The last cast does not have a loader associated with it and as the result the script fails on the backend with the following error: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-804) problem with lineage with double map redirection
[ https://issues.apache.org/jira/browse/PIG-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-804: --- Attachment: PIG-804.patch The root cause was in the parsers, in CastExp(), a getFieldSchema() was being called on the target operand of the cast to get the alias. This had a side effect of setting up lineage information (i.e. the canonical map in the operand). Apparently this place in the code is early for setting up lineage information since operators may be added/removed later on due to optimizations. This should be done at a later safe point (this change will be tracked in PIG-808). For a fix now, unsetFieldSchema() is called to unset the lineage information. problem with lineage with double map redirection Key: PIG-804 URL: https://issues.apache.org/jira/browse/PIG-804 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-804.patch v1 = load 'data' as (s,m,l); v2 = foreach v1 GENERATE s#'src_spaceid' AS vspaceid ; v3 = foreach v2 GENERATE (chararray)vspaceid#'foo'; explain v3; The last cast does not have a loader associated with it and as the result the script fails on the backend with the following error: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-808) getFieldSchema() in ExpressionOperators also sets up lineage information - this can cause issues if getFieldSchema() is called too early
getFieldSchema() in ExpressionOperators also sets up lineage information - this can cause issues if getFieldSchema() is called too early Key: PIG-808 URL: https://issues.apache.org/jira/browse/PIG-808 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 See PIG-804 for a use case which exposes this bug. We should probably be setting up lineage information outside getFieldSchema() through a visitor at a point where we know it is safe - (just before TypeCheckingVisitor?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
[ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711769#action_12711769 ] Pradeep Kamath commented on PIG-802: Review comments: In MRCompiler, does POPackageLite need to be used in the following too: {noformat} if (limit!=-1) { POPackage pkg_c = new POPackage(new OperatorKey(scope,nig.getNextNodeId(scope))); ... } {noformat} In POPackage, the following declarations : {noformat} IteratorNullableTuple tupIter; Object key; {noformat} should have protected access specifier to make the intent that these are used in POPackageLite explicit. In ReadOnceBag.equals() you could also check if the keyInfo maps are equal. The getValueTuple() in ReadOnceBag had duplicate code from POPackage.getValueTuple(). Instead of having the same code in two places, I am wondering if you could just construct ReadOnceBag with a POPackageLite instance passed in the constructor. Then if you make the POPackageLite.getValueTuple() method public, you can just invoke it from ReadOnceBag code. This way the code remains in one place. PERFORMANCE: not creating bags for ORDER BY --- Key: PIG-802 URL: https://issues.apache.org/jira/browse/PIG-802 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: OrderByOptimization.patch Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
[ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711811#action_12711811 ] Pradeep Kamath commented on PIG-802: I think even in the future if ReadOnceBags are used in places other than order by, they would need to be used immediately after a POPackageLite. So tying the two together is not bad and would reduce code duplication. PERFORMANCE: not creating bags for ORDER BY --- Key: PIG-802 URL: https://issues.apache.org/jira/browse/PIG-802 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: OrderByOptimization.patch Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-814) Make Binstorage more robust when data contains record markers
Make Binstorage more robust when data contains record markers - Key: PIG-814 URL: https://issues.apache.org/jira/browse/PIG-814 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 When the inputstream for BinStorage is at a position where the data has the record marker sequence, the code incorrectly assumes that it is at the beginning of a record (tuple) and calls DataReaderWriter.readDatum() trying to read the tuple. The problem is more likely when RandomSampleLoader (used in order by implementation) skips the input stream for sampling and calls Binstorage.getNext(). The code should be more robust in such cases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-804) problem with lineage with double map redirection
[ https://issues.apache.org/jira/browse/PIG-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-804: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch was commited on May 13 2009. problem with lineage with double map redirection Key: PIG-804 URL: https://issues.apache.org/jira/browse/PIG-804 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Olga Natkovich Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-804.patch v1 = load 'data' as (s,m,l); v2 = foreach v1 GENERATE s#'src_spaceid' AS vspaceid ; v3 = foreach v2 GENERATE (chararray)vspaceid#'foo'; explain v3; The last cast does not have a loader associated with it and as the result the script fails on the backend with the following error: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY
[ https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713181#action_12713181 ] Pradeep Kamath commented on PIG-802: Changes look good - still have a comment about the change in MRCompiler.java: In MRCompiler, does POPackageLite need to be used in the following too: {noformat} if (limit!=-1) { POPackage pkg_c = new POPackage(new OperatorKey(scope,nig.getNextNodeId(scope))); ... } {noformat} PERFORMANCE: not creating bags for ORDER BY --- Key: PIG-802 URL: https://issues.apache.org/jira/browse/PIG-802 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: OrderByOptimization.patch Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-816) PigStorage() does not accept Unicode characters in its contructor
[ https://issues.apache.org/jira/browse/PIG-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-816: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch committed. PigStorage() does not accept Unicode characters in its contructor -- Key: PIG-816 URL: https://issues.apache.org/jira/browse/PIG-816 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Priority: Critical Fix For: 0.3.0 Attachments: PIG-816.patch, pig_1243043613713.log Simple Pig script which uses Unicode characters in the PigStorage() constructor fails with the following error: {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab by age; Y2 = FOREACH X2 GENERATE group, COUNT(studenttab); store Y2 into '/user/viraj/y2' using PigStorage('\u0001'); {code} ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate exception from backend error: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference #1 is an invalid XML character. Attaching log file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715325#action_12715325 ] Pradeep Kamath commented on PIG-796: A few comments: - In TestPOCast.java the variables can be named as something like opWithInputTypeAsByteArray for the POCast objects since the intent is not so clear with the current names - In POCast.java you can check for the realType inside the catch clause rather than before trying to cast to ByteArray. This way, if the cast to ByteArray is always successful, we will not be incurring the overhead of the if(realType==null) check - In POCast.java, you can avoid catching ExecException and checking for errorCode == 1071. Since the getNext() call in POCast already throws ExecException, you can just let ExecExceptions from DataType.toXXX() methods bubble out. support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: 796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-796: --- Status: Patch Available (was: Open) support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: 796.patch, pig-796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-796: --- Status: Open (was: Patch Available) support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: 796.patch, pig-796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-796: --- Status: Patch Available (was: Open) support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: 796.patch, pig-796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-796: --- Status: Open (was: Patch Available) support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: 796.patch, pig-796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-796: --- Resolution: Fixed Fix Version/s: 0.3.0 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch commited - thanks for contributing Ashutosh! support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: 0.3.0 Attachments: 796.patch, pig-796.patch, pig-796.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Attachment: PIG-835.patch The root cause of the issue is that the current multiQueryOptimizer checks if the map key is of the same type for different map plans it merges. If they are of different types, it ensures that the type is made tuple for all map plans - this implies keys which are not tuples will be wrapped in an extra tuple and keys which are already of Tuple type will be left alone (this is ensured in POLocalRearrange). However the Demux operator which passes the key and bag of values to the merged reduce plan currently always unwraps the tuple whenever the map keys are different. This results in unwrapping of keys which were originally tuples and should not be unwrapped. The attached patch fixes this by storing an array of boolean flags in the Demux operator to indicates which map keys are wrapped and which are not so that unwrapping occurs only in cases where the original map key was not already a tuple and was wrapped. Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-838) Parser does not handle ctrl-m ('\u000d') as argument to PigStorage
Parser does not handle ctrl-m ('\u000d') as argument to PigStorage -- Key: PIG-838 URL: https://issues.apache.org/jira/browse/PIG-838 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath An script which has a = load 'input' using PigStorage('\u000d'); produces the following error: 2009-06-05 14:47:49,241 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 47. Encountered: \r (13), after : \' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Status: Patch Available (was: Open) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Status: Open (was: Patch Available) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Attachment: PIG-835-v2.patch New patch with findbugs warnings addressed - essentially findbugs wanted the public static members in PigNUllableWritable to be marked final. Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835-v2.patch, PIG-835.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Status: Patch Available (was: Open) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835-v2.patch, PIG-835.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort
PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort -- Key: PIG-841 URL: https://issues.apache.org/jira/browse/PIG-841 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently the sample map reduce job in order by implementation does the following: - sample 100 records from each map - group all on the above output - sort the output bag from the above grouping on keys of the order by - give the sorted bag to FindQuantiles udf The steps 2 and 3 above can be replaced by - group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort
[ https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717835#action_12717835 ] Pradeep Kamath commented on PIG-841: This mechanism can be used for any join which requires sampling like the one described in http://wiki.apache.org/pig/PigSkewedJoinSpec PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort -- Key: PIG-841 URL: https://issues.apache.org/jira/browse/PIG-841 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently the sample map reduce job in order by implementation does the following: - sample 100 records from each map - group all on the above output - sort the output bag from the above grouping on keys of the order by - give the sorted bag to FindQuantiles udf The steps 2 and 3 above can be replaced by - group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Patch commited to both trunk and branch-0.3 Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835-v2.patch, PIG-835.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan
[ https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-846: --- Attachment: PIG-846.patch MultiQuery optimization in some cases has an issue when there is a split in the map plan - Key: PIG-846 URL: https://issues.apache.org/jira/browse/PIG-846 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-846.patch The following script produces the error that follows: {noformat} A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); B = FOREACH A GENERATE f0, f1, f2, f3, f4; B1 = foreach B generate f0, f1, f2; C = GROUP B1 BY (f1, f2); STORE C into 'foo1'; B2 = FOREACH B GENERATE f0, f3, f4; E = GROUP B2 BY (f3, f4); STORE E into 'foo2'; F = FOREACH A GENERATE f0, f5, f6, f7, f8; F1 = FOREACH F GENERATE f0, f5,f6; G = GROUP F1 BY (f5, f6); STORE G into 'foo3'; F2 = FOREACH F GENERATE f0, f7, f8; I = GROUP F2 BY (f7, f8); STORE I into 'foo4'; {noformat} Exception encountered during execution: {noformat} java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-847) Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag
Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag - Key: PIG-847 URL: https://issues.apache.org/jira/browse/PIG-847 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Currently Pig interprets the result type of a relation as a bag. However the schema of the relation directly contains the schema describing the fields in the tuples for the relation. However when a udf wants to return a bag or if there is a bag in input data or if the user creates a bag constant, the schema of the bag has one field schema which is that of the tuple. The Tuple's schema has the types of the fields. To be able to access the fields from the bag directly in such a case by using something like bagname.fieldname or bag.fieldposition, the schema of the bag should have the twoLevelAccess set to true so that pig's type system can get traverse the tuple schema and get to the field in question. This is confusing - we should try and see if we can avoid needing this extra flag. A possible solution is to treat bags the same way - whether they represent relations or real bags. Another way is to introduce a special relation datatype for the result type of a relation and bag type would be used only for true bags. In this case, we would always need bag schema to have a tuple schema which would describe the fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-848) Explain output sometimes may not match the exact plan that is executed in terms of the order in which inner plans and operators are presented - (semantically the plans are th
Explain output sometimes may not match the exact plan that is executed in terms of the order in which inner plans and operators are presented - (semantically the plans are the same) - Key: PIG-848 URL: https://issues.apache.org/jira/browse/PIG-848 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath The visitors used for explain and in MRCompiler do not guarantee order - hence the plan shown in explain output may not match the plan that is finally executed. This is not a bug but makes debugging harder. If the plan that is executed is different from the one in explain, it would still be the same in terms of semantics - the difference would only be in the order of inner plans and operators. It would be nice if we could have an order preserving way of showing explain output which would also be used to construct the plan (MRPlan) which is finally executed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan
[ https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-846: --- Status: Open (was: Patch Available) Will be resubmitting a new patch - just realized that a few unit tests are broken MultiQuery optimization in some cases has an issue when there is a split in the map plan - Key: PIG-846 URL: https://issues.apache.org/jira/browse/PIG-846 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-846.patch The following script produces the error that follows: {noformat} A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); B = FOREACH A GENERATE f0, f1, f2, f3, f4; B1 = foreach B generate f0, f1, f2; C = GROUP B1 BY (f1, f2); STORE C into 'foo1'; B2 = FOREACH B GENERATE f0, f3, f4; E = GROUP B2 BY (f3, f4); STORE E into 'foo2'; F = FOREACH A GENERATE f0, f5, f6, f7, f8; F1 = FOREACH F GENERATE f0, f5,f6; G = GROUP F1 BY (f5, f6); STORE G into 'foo3'; F2 = FOREACH F GENERATE f0, f7, f8; I = GROUP F2 BY (f7, f8); STORE I into 'foo4'; {noformat} Exception encountered during execution: {noformat} java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan
[ https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-846: --- Attachment: PIG-846-v2.patch New patch - the only change is to not add extra information in POLocalRearrange.name() - was in the earlier patch only to add more information in explain outputs but this breaks some unit tests. TestHBaseStorage unit test still fails for me but the failure is not related to the changes in the patch - am assuming that is an environment issue on my machine. MultiQuery optimization in some cases has an issue when there is a split in the map plan - Key: PIG-846 URL: https://issues.apache.org/jira/browse/PIG-846 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-846-v2.patch, PIG-846.patch The following script produces the error that follows: {noformat} A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); B = FOREACH A GENERATE f0, f1, f2, f3, f4; B1 = foreach B generate f0, f1, f2; C = GROUP B1 BY (f1, f2); STORE C into 'foo1'; B2 = FOREACH B GENERATE f0, f3, f4; E = GROUP B2 BY (f3, f4); STORE E into 'foo2'; F = FOREACH A GENERATE f0, f5, f6, f7, f8; F1 = FOREACH F GENERATE f0, f5,f6; G = GROUP F1 BY (f5, f6); STORE G into 'foo3'; F2 = FOREACH F GENERATE f0, f7, f8; I = GROUP F2 BY (f7, f8); STORE I into 'foo4'; {noformat} Exception encountered during execution: {noformat} java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)
[ https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-835: --- Attachment: PIG-846-v2.patch New patch - the only change is to not add extra information in POLocalRearrange.name() - was in the earlier patch only to add more information in explain outputs but this breaks some unit tests. TestHBaseStorage unit test still fails for me but the failure is not related to the changes in the patch - am assuming that is an environment issue on my machine. Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type) -- Key: PIG-835 URL: https://issues.apache.org/jira/browse/PIG-835 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-835-v2.patch, PIG-835.patch, PIG-846-v2.patch A query like the following results in an exception on execution: {noformat} a = load 'mult.input' as (name, age, gpa); b = group a ALL; c = foreach b generate group, COUNT(a); store c into 'foo'; d = group a by (name, gpa); e = foreach d generate flatten(group), MIN(a.age); store e into 'bar'; {noformat} Exception on execution: 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan
[ https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-846: --- Status: Patch Available (was: Open) MultiQuery optimization in some cases has an issue when there is a split in the map plan - Key: PIG-846 URL: https://issues.apache.org/jira/browse/PIG-846 Project: Pig Issue Type: Bug Affects Versions: 0.2.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.3.0 Attachments: PIG-846-v2.patch, PIG-846.patch The following script produces the error that follows: {noformat} A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); B = FOREACH A GENERATE f0, f1, f2, f3, f4; B1 = foreach B generate f0, f1, f2; C = GROUP B1 BY (f1, f2); STORE C into 'foo1'; B2 = FOREACH B GENERATE f0, f3, f4; E = GROUP B2 BY (f3, f4); STORE E into 'foo2'; F = FOREACH A GENERATE f0, f5, f6, f7, f8; F1 = FOREACH F GENERATE f0, f5,f6; G = GROUP F1 BY (f5, f6); STORE G into 'foo3'; F2 = FOREACH F GENERATE f0, f7, f8; I = GROUP F2 BY (f7, f8); STORE I into 'foo4'; {noformat} Exception encountered during execution: {noformat} java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.