[jira] Updated: (PIG-537) Failure in Hadoop map collect stage due to type mismatch in the keys used in cogroup

2008-11-20 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-537:
---

Status: Patch Available  (was: Open)

The issue was in Implicit Split inserter. In this query, the same load provides 
input to two cogroups. Hence an implicit split needs to be introduced. However 
the ImplicitSplitInserter was changing the order of the inputs to the first 
cogroup as it was rewiring the plan with the new Split and SplitOutput 
operators. The patch changes the algorithm for introducing these new operators 
so that the order of the inputs for the successors of the load is maintained.

 Failure in Hadoop map collect stage due to type mismatch in the keys used in 
 cogroup
 

 Key: PIG-537
 URL: https://issues.apache.org/jira/browse/PIG-537
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
Priority: Critical
 Fix For: types_branch

 Attachments: explain_aliasC.log, mygrades.txt, mymarks.txt


 Consider the following pig query, which demonstrates various problems during 
 the Logical Plan creation and the subsequent execution of the M/R job. In 
 this query we do two cogroups, one between A and B to generate an alias 
 ABtemptable. Then we again cogroup A with ABtemptable based on marks which 
 was read in as an int. 
 ==
 {code}
 A = load 'mymarks.txt' as (marks:int, username:chararray);
 B = load 'mygrades.txt' as (username:chararray,grade:chararray);
 ABtemp = cogroup A by username, B  by username;
 ABtemptable = foreach ABtemp generate
group as username,
flatten(A.marks) as newmarks;
 --describe ABtemptable;
 C = cogroup A by marks, ABtemptable by newmarks;
 --describe C;
 explain C;
 dump C;
 {code}
 ==
 The schema for C and ABtemptable which pig reports:
 ==
 {code}describe ABtemptable;{code} ABtemptable: {username: chararray,newmarks: 
 int}
 {code}describe C;{code} C: {group: int,A: {username: chararray,marks: 
 int},ABtemptable: {username: chararray,newmarks: int}}
 ==
 If you run the above query you get the following error:
 ==
 2008-11-18 03:57:14,372 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error 
 message from task (map) task_200810152105_0156_m_00java.io.IOException: 
 Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, 
 recieved org.apache.pig.impl.io.NullableIntWritable
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:97)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:172)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:82)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ==
 Looking at the {code}explain C;{code} output, you see that newmarks has 
 become a chararray (surprising!!)
 ==
 ---CoGroup viraj-Tue Nov 18 03:49:42 UTC 2008-25 Schema: {group: 
 Unknown,{username: bytearray,marks: int},ABtemptable: {username: 
 chararray,newmarks: chararray}} Type: bag
   Project viraj-Tue Nov 18 03:49:42 UTC 2008-23 Projections: [1] 
 Overloaded: false FieldSchema: marks: int Type: int
   Input: SplitOutput[null] viraj-Tue Nov 18 03:49:42 UTC 2008-29
   Project viraj-Tue Nov 18 03:49:42 UTC 2008-24 Projections: [1] 
 Overloaded: false FieldSchema: newmarks: chararray Type: chararray
Input: ForEach viraj-Tue Nov 18 03:49:42 UTC 2008-22
 ---ForEach viraj-Tue Nov 18 03:49:42 UTC 2008-22 Schema: {username: 
 chararray,newmarks: chararray} Type: bag
 ==
 In Summary this script demonstrates the following problems:
 1) Logical Plan 

[jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query

2008-12-17 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-563:
---

Attachment: PIG-563.patch

 PERFORMANCE: enable combiner to be called 0 or more times whenver the 
 combiner is used for a pig query
 --

 Key: PIG-563
 URL: https://issues.apache.org/jira/browse/PIG-563
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-563.patch


 Currently Pig's use of the combiner assumes the combiner is called exactly 
 once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more 
 times. This issue is to track changes needed in the CombinerOptimizer visitor 
 and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work 
 in this new model.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-573) Changes to make Pig run with Hadoop 19

2008-12-19 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-573:
---

Attachment: hadoop19.jar
PIG-573-combinerflag.patch
PIG-573.patch

PIG-573.patch contains the main changes for hadoop19. Essentially the changes 
were to reflect the renaming of org.apache.hadoop.dfs as 
org.apache.hadoop.hdfs, use of correct api in case of  deprecated ones and 
change in build.xml to use hadoop19.jar.

Caveats:
1) http://issues.apache.org/jira/browse/PIG-563 has a change in 
JobControlCompiler to remove the use of jobConf.setCombineOnceOnly(true). So 
if PIG-573.patch needs to be used BEFORE PIG-563 is committed, 
PIG-573-combinerflag.patch should also be used to get this change.
2) PIG-573 depends on a hadoop19.jar being present in lib dir. I have attached 
a hadoop19.jar for use.
3) Changes are needed in the combiner to work with this patch - those changes 
are in http://issues.apache.org/jira/browse/PIG-563

 Changes to make Pig run with Hadoop 19
 --

 Key: PIG-573
 URL: https://issues.apache.org/jira/browse/PIG-573
 Project: Pig
  Issue Type: Task
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: hadoop19.jar, PIG-573-combinerflag.patch, PIG-573.patch


 This issue tracks changes to Pig code to make it work with Hadoop-0.19.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-573) Changes to make Pig run with Hadoop 19

2008-12-22 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658662#action_12658662
 ] 

Pradeep Kamath commented on PIG-573:


Hbase code in pig doesn't work with hadoop 19 and the files submitted thus far 
do not address this. A hbase-0.19.0.jar and hbase-0.19.0-test.jar are required 
and possibly other changes may be required.

 Changes to make Pig run with Hadoop 19
 --

 Key: PIG-573
 URL: https://issues.apache.org/jira/browse/PIG-573
 Project: Pig
  Issue Type: Task
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: hadoop19.jar, PIG-573-combinerflag.patch, PIG-573.patch


 This issue tracks changes to Pig code to make it work with Hadoop-0.19.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-563) PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner is used for a pig query

2008-12-22 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-563:
---

Attachment: PIG-563-v3.patch

COUNT.Initial was implemented that way so that in case it is called in the 
non-combiner case in the reduce, it would produce the right result. However 
since currently we plan to call COUNT.initial only when the combine plan is 
also present, we can be guaranteed it is called only in the Map - So I have 
changed it to emit 1 as suggested in the review comment in the new version in 
the attachment.

 PERFORMANCE: enable combiner to be called 0 or more times whenver the 
 combiner is used for a pig query
 --

 Key: PIG-563
 URL: https://issues.apache.org/jira/browse/PIG-563
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-563-v2.patch, PIG-563-v3.patch, PIG-563.patch


 Currently Pig's use of the combiner assumes the combiner is called exactly 
 once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more 
 times. This issue is to track changes needed in the CombinerOptimizer visitor 
 and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work 
 in this new model.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach

2008-12-31 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-580:
---

Attachment: PIG-580-v2.patch

Attaching a new version (PIG-580-v2.patch) - the only difference from the 
earlier one is that I have removed a debug statement out of 
test/org/apache/pig/test/Util.java.

 PERFORMANCE: Combiner should also be used when there are distinct aggregates 
 in a foreach following a group provided there are no non-algebraics in the 
 foreach 
 

 Key: PIG-580
 URL: https://issues.apache.org/jira/browse/PIG-580
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-580-v2.patch, PIG-580.patch


 Currently Pig uses the combiner only when there is foreach following a group 
 when the elements in the foreach generate have the following characteristics:
 1) simple project of the group column
 2) Algebraic UDF
 The above conditions exclude use of the combiner for distinct aggregates - 
 the distinct operation itself is combinable (irrespective of whether it feeds 
 to an algebraic or non algebraic udf). So if the following foreach should 
 also be combinable:
 {code}
 ..
 b = group a by $0;
 c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
 {code}
 The combiner optimizer should cause the distinct to be combined and the final 
 combine output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-581) Pig should enable an option to disable the use of combiner optimizer

2008-12-31 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-581:
---

Summary: Pig should enable an option to disable the use of combiner 
optimizer  (was: Pig should enable an option to disable the use of optimizer)

 Pig should enable an option to disable the use of combiner optimizer
 

 Key: PIG-581
 URL: https://issues.apache.org/jira/browse/PIG-581
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
 Fix For: types_branch


 There are some cases where a combiner optimization chosen by Pig may actually 
 be slower than the non optimized version. For example, the use of combiner to 
 address the issue reported in https://issues.apache.org/jira/browse/PIG-580 
 could result in slower execution IF the distinct on groups of values does not 
 actually shrink those groups. This is however very data dependent and the 
 user may know before hand that this might be the case and may wish to disable 
 the use of the optimizer. Pig should enable an option to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-554) Fragment Replicate Join

2009-01-07 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-554:
---

Attachment: PIG-554-v4.patch

Changes in new patch (attached):
1) The HashMap now has (tuple, ListTuple) to address the concern that Bag 
would be worse spacewise than a ListTuple. BagFactory now has a method 
newDefaultBag(ListTuple) which will create a DefaultDataBag out of the 
ListTuple by taking ownership of the list and without copying the elements. 
This way in POFRJoin.getNext() we can create a bag out of the ListTuple 
without much overhead.
2) Added back the unit test - TestFRJoin - the change in this file is to use 
Util.createInputFile()  to create input file for the tests on the minicluster 
DFS rather than local file system. It used Util.deleteFile() to delete the file 
before after each test run.

 Fragment Replicate Join
 ---

 Key: PIG-554
 URL: https://issues.apache.org/jira/browse/PIG-554
 Project: Pig
  Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shravan Matthur Narayanamurthy
Assignee: Shravan Matthur Narayanamurthy
 Fix For: types_branch

 Attachments: frjofflat.patch, frjofflat1.patch, PIG-554-v3.patch, 
 PIG-554-v4.patch


 Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
 table and a very small table (fitting in memory small) and the join doesn't 
 expand the data by much. The idea is to distribute the processing of the huge 
 files by fragmenting it and replicating the small file to all machines 
 receiving a fragment of the huge file. Because of the availability of the 
 entire small file, the join becomes a trivial task without needing any break 
 in the pipeline. Exhaustive test have done to determine the improvement we 
 get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
 The patch makes changes to parts of the code where new operators are 
 introduced. Currently, when a new operator is introduced, its alias is not 
 set. For schema computation I have modified this behaviour to set the alias 
 of the new operator to that of its predecessor. The logical side of the patch 
 mimics the cogroup behavior as join syntax closely resembles that of cogroup. 
 Currently, this patch doesn't have support for joins other than inner joins. 
 The rest of the code has been documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-628) PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, set up of PigMapReduce PigCombiner, accessing index in POLocalRearrange

2009-01-20 Thread Pradeep Kamath (JIRA)
PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, 
set up of PigMapReduce  PigCombiner, accessing index in POLocalRearrange
-

 Key: PIG-628
 URL: https://issues.apache.org/jira/browse/PIG-628
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch


- Currently DefaultTuple.write() needlessly writes a marker for null/not null. 
This is already handled by PigNullableWritable for keys and NullableTuple for 
values. Nested null tuples inside a tuple are written out as nulls in 
DataReaderWriter.writeDatum. So the null/not null marker in DefaultTuple can be 
avoided.

- In PigMapReduce and PigCombiner the roots and leaves of the plans are 
calculated in each reduce() call. Instead these can be computed in configure() 
one time.

- In each call of POLocalRearrange.getNext(), a new lroutput tuple is created 
whose first position is filled with index, second with key and third with value 
- this can be optimized by having a tuple member in POLocalRearrange which is 
reused in each getNext() call. Further, the first position of this tuple can be 
pre-filled with the index in the setIndex() method of POLocalRearrange at 
script compile time.

- In POCombinerPackage, the metadata data structures to figure out which parts 
of the value are present in the key can be set up in the setKeyInfo() method at 
compile time. This is because we currently use POCombinerPackage only with a 
group by. Hence we don't need to look up the metadata at run time based on 
input index since there will be only one input (index = 0)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-628) PERFORMANCE: Misc. optimizations including optimization in Tuple serialization, set up of PigMapReduce PigCombiner, accessing index in POLocalRearrange

2009-01-20 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-628:
---

Attachment: PIG-628.patch

Attached patch which implements the changes described in the issue description.

 PERFORMANCE: Misc. optimizations including optimization in Tuple 
 serialization, set up of PigMapReduce  PigCombiner, accessing index in 
 POLocalRearrange
 -

 Key: PIG-628
 URL: https://issues.apache.org/jira/browse/PIG-628
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-628.patch


 - Currently DefaultTuple.write() needlessly writes a marker for null/not 
 null. This is already handled by PigNullableWritable for keys and 
 NullableTuple for values. Nested null tuples inside a tuple are written out 
 as nulls in DataReaderWriter.writeDatum. So the null/not null marker in 
 DefaultTuple can be avoided.
 - In PigMapReduce and PigCombiner the roots and leaves of the plans are 
 calculated in each reduce() call. Instead these can be computed in 
 configure() one time.
 - In each call of POLocalRearrange.getNext(), a new lroutput tuple is created 
 whose first position is filled with index, second with key and third with 
 value - this can be optimized by having a tuple member in POLocalRearrange 
 which is reused in each getNext() call. Further, the first position of this 
 tuple can be pre-filled with the index in the setIndex() method of 
 POLocalRearrange at script compile time.
 - In POCombinerPackage, the metadata data structures to figure out which 
 parts of the value are present in the key can be set up in the setKeyInfo() 
 method at compile time. This is because we currently use POCombinerPackage 
 only with a group by. Hence we don't need to look up the metadata at run 
 time based on input index since there will be only one input (index = 0)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-634) When POUnion is one of the roots of a map plan, POUnion.getNext() gives a null pointer exception

2009-01-26 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-634:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 When POUnion is one of the roots of a map plan, POUnion.getNext() gives a 
 null pointer exception
 

 Key: PIG-634
 URL: https://issues.apache.org/jira/browse/PIG-634
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-634.patch


 POUnion.getnext() gives a null pointer exception in the following scenario 
 (pasted from a code comment explaining the fix for this issue). If a script 
 results in a plan like the one below, currently POUnion.getNext() gives a 
 null pointer exception
 {noformat}
 
 // POUnion
 // |
 // |--POLocalRearrange
 // ||
 // ||-POUnion (root 2)-- This union's getNext() can lead 
 the code here
 // |
 // |--POLocalRearrange (root 1)
 
 // The inner POUnion above is a root in the plan which has 2 
 roots.
 // So these 2 roots would have input coming from different 
 input
 // sources (dfs files). So certain maps would be working on 
 input only
 // meant for root 1 above and some maps would work on input
 // meant only for root 2. In the former case, root 2 would
 // neither get input attached to it nor does it have 
 predecessors
 {noformat}
 A script which can cause a plan like above is:
 {code}
 a = load 'xyz'; 
 b = load 'abc'; 
 c = union a,b; 
 d = load 'def'; 
 e = cogroup c by $0 inner , d by $0 inner;
 dump e;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-634) When POUnion is one of the roots of a map plan, POUnion.getNext() gives a null pointer exception

2009-01-26 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667439#action_12667439
 ] 

Pradeep Kamath commented on PIG-634:


Patch committed

 When POUnion is one of the roots of a map plan, POUnion.getNext() gives a 
 null pointer exception
 

 Key: PIG-634
 URL: https://issues.apache.org/jira/browse/PIG-634
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-634.patch


 POUnion.getnext() gives a null pointer exception in the following scenario 
 (pasted from a code comment explaining the fix for this issue). If a script 
 results in a plan like the one below, currently POUnion.getNext() gives a 
 null pointer exception
 {noformat}
 
 // POUnion
 // |
 // |--POLocalRearrange
 // ||
 // ||-POUnion (root 2)-- This union's getNext() can lead 
 the code here
 // |
 // |--POLocalRearrange (root 1)
 
 // The inner POUnion above is a root in the plan which has 2 
 roots.
 // So these 2 roots would have input coming from different 
 input
 // sources (dfs files). So certain maps would be working on 
 input only
 // meant for root 1 above and some maps would work on input
 // meant only for root 2. In the former case, root 2 would
 // neither get input attached to it nor does it have 
 predecessors
 {noformat}
 A script which can cause a plan like above is:
 {code}
 a = load 'xyz'; 
 b = load 'abc'; 
 c = union a,b; 
 d = load 'def'; 
 e = cogroup c by $0 inner , d by $0 inner;
 dump e;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-636) PERFORMANCE: Use lightweight bag implementations which do not register with SpillableMemoryManager with Combiner

2009-01-26 Thread Pradeep Kamath (JIRA)
PERFORMANCE: Use lightweight bag implementations which do not register with 
SpillableMemoryManager with Combiner


 Key: PIG-636
 URL: https://issues.apache.org/jira/browse/PIG-636
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch


Currently whenever Combiner is used in pig, in the map, the 
POPrecombinerLocalRearrange operator puts the single value tuple 
corresponding to a key into a DataBag and passes this to the foreach which is 
being combined. This will generate as many bags as there are input records. 
These bags all will have a single tuple and hence are small and should not need 
to be spilt to disk. However since the bags are created through the BagFactory 
mechanism, each bag creation is registered with the SpillableMemoryManager and 
a weak reference to the bag is stored in a linked list. This linked list grows 
really big over time causing unnecessary Garbage collection runs. This can be 
avoided by having a simple lightweight implementation of the DataBag interface 
to store the single tuple in a bag. Also these SingleTupleBags should be 
created without registering with the spillableMemoryManager. Likewise the bags 
created in POCombinePackage are supposed to fit in Memory and not spill. Again 
a NonSpillableDataBag implementation of DataBag interface which does not 
register with the SpillableMemoryManager would help.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-645) Streaming is broken with the latest trunk

2009-01-29 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-645:
---

Attachment: PIG-645.patch

 Streaming is broken with the latest trunk
 -

 Key: PIG-645
 URL: https://issues.apache.org/jira/browse/PIG-645
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-645.patch


 Several tests we run are failing now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-645) Streaming is broken with the latest trunk

2009-01-29 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-645:
---

Fix Version/s: types_branch
Affects Version/s: types_branch
   Status: Patch Available  (was: Open)

Attached patch to fix the issues with streaming. The root cause of the issue 
was the changes introduced by PIG-629 (PERFORMANCE: Eliminate use of 
TargetedTuple for each input tuple in the map()) caused a race condition on the 
input tuple in the map() between recordReader.next(Tuple value) and the 
streaming binary.

BEFORE PIG-629, the flow of a tuple from record reader to map was as follows:
The recordReader instance gets the *same* TargetedTuple object reference in 
every next(TargetedTuple value) call (this is because Hadoop reuses the value 
object for each recordReader.next(value) call). The recordReader.next(value) 
call inturn calls PigSlice.next(Tuple value) which has the following 
implementation:
{code}
public boolean next(Tuple value) throws IOException {
Tuple t = loader.getNext();
if (t == null) {
return false;
}
value.reference(t);
return true;
}
{code}
Here value.reference(t) calls the TargetedTuple.reference(Tuple) method which 
simply stores the supplied the tuple in its member Tuple variable t. 

In PigMapBase.map(), the toTuple() method on the input TargetedTuple is called 
which returns the above store tuple reference t. This reference is then 
attached to the roots of the map plan. 

The point to note is this final tuple reference which is used by the operators 
in the map plan is the reference to the tuple returned from the loader and not 
the reference to the TargetedTuple which we get from the recordReader and which 
is supplied as an argument to the map() call. The loader creates a new tuple 
reference on each getNext(). This guarantees that the operators in the map plan 
always work with a differnt tuple reference on each map() call though the 
TargetedTuple reference supplied in the map() is the same and reused by Hadoop.

AFTER PIG-629, the flow changed as follows:
TargetedTuple was removed and Tuple was used instead. The PigSlice.next(Tuple 
value) code remained intact. However DefaultTuple.reference(Tuple) call in it 
assigns the internal mFields arraylist to the arraylist of the supplied tuple. 
Note that here the internal member arraylist of the DefaultTuple is changed to 
refer to the internal arraylist of the Tuple the loader gives. 
In map(), the tuple which is supplied as input argument to the map() call is 
attached directly to the roots. So in the case of streaming, this tuple is 
finally supplied to the binary by using a storage function (PigStorage by 
default). However this tuple refernce is the same as the one which gets reused 
by hadoop in the next recordReader.next(value) call. So while the storage 
function is in the process of writing the current Tuple's contents (the mFields 
arraylist), it can get changed underneath due to recordReader.next(value) call. 
So unless the storage functions writes to the binary's stdin BEFORE the next 
recordReader.next(value) call, the input sent to the Binary will be garbled.

The fix is the following one line change:
{noformat}
 for (PhysicalOperator root : roots) {
-root.attachInput(inpTuple);
+root.attachInput(tf.newTupleNoCopy(inpTuple.getAll()));
 }

{noformat}

In map(), instead of attaching the inpTuple directly to the roots of the plan, 
a new Tuple is created which refers to the same mFields arrayList as in 
inpTuple. With this change, all operators in the map plan, now work on a 
different Tuple reference from the one which is supplied in the map() argument 
(and which is reused by Hadoop). This reference will refer to the mFields of 
the Tuple returned from the loader which is guaranteed to be a new arraylist 
for each input tuple since the loader creates a new Tuple each time. 

 Streaming is broken with the latest trunk
 -

 Key: PIG-645
 URL: https://issues.apache.org/jira/browse/PIG-645
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-645.patch


 Several tests we run are failing now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()

2009-01-30 Thread Pradeep Kamath (JIRA)
RandomSampleLoader does not handle skipping correctly in getNext()
--

 Key: PIG-649
 URL: https://issues.apache.org/jira/browse/PIG-649
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch


Currently RandomSampleLoader calls skip() on the underlying input stream 
(BufferedPositionedInputStream) in its getNext(). The input stream may not 
actually skip over the amount the RandomSampleLoader needs in one call. 
RandomSampleLoader should check the return value from the skip() call and 
ensure that skip() is called repeatedly (if necessary) till the needed number 
of bytes are skipped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens

2009-01-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-651:
---

Status: Patch Available  (was: Open)

Attached patch implements a simpler POForEachNoFlatten whenever no flattens are 
present in the POForEach (this is determined in LogToPhyTranslator and is also 
used in CombinerOptimizer for the map and combine plan foreachs). Initial tests 
show marginal speedup. (10 seconds out of 9 mins 50 secs in one particular 
group by query). 

 PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach 
 has no flattens
 ---

 Key: PIG-651
 URL: https://issues.apache.org/jira/browse/PIG-651
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-651.patch


 POForEach has lot of code to handle flattening (cross product) of the fields 
 in the generate. This is relevant only when atleast one field in the generate 
 needs to be flattened. If all fields in the generate do not need to be 
 flattened, a more simplified and hopefully more efficient POForEach can be 
 used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()

2009-01-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-649:
---

Status: Patch Available  (was: Open)

The attached patch fixes the issue by keep track of the return value of the 
underlying input stream's skip(). If enough bytes are not skipped on the 
initial call, multiple calls are made till enough bytes are skipped

 RandomSampleLoader does not handle skipping correctly in getNext()
 --

 Key: PIG-649
 URL: https://issues.apache.org/jira/browse/PIG-649
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-649.patch


 Currently RandomSampleLoader calls skip() on the underlying input stream 
 (BufferedPositionedInputStream) in its getNext(). The input stream may not 
 actually skip over the amount the RandomSampleLoader needs in one call. 
 RandomSampleLoader should check the return value from the skip() call and 
 ensure that skip() is called repeatedly (if necessary) till the needed number 
 of bytes are skipped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()

2009-01-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-649:
---

Attachment: PIG-649.patch

 RandomSampleLoader does not handle skipping correctly in getNext()
 --

 Key: PIG-649
 URL: https://issues.apache.org/jira/browse/PIG-649
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-649.patch


 Currently RandomSampleLoader calls skip() on the underlying input stream 
 (BufferedPositionedInputStream) in its getNext(). The input stream may not 
 actually skip over the amount the RandomSampleLoader needs in one call. 
 RandomSampleLoader should check the return value from the skip() call and 
 ensure that skip() is called repeatedly (if necessary) till the needed number 
 of bytes are skipped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-648) BinStorage fails when it finds markers unexpectedly in the data

2009-01-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-648:
---

Status: Patch Available  (was: Open)

Attached patch which now uses ctrl-A, Ctrl-B, Ctrl-C (0x01,0x02,0x03) as teh 
rcord markers.

 BinStorage fails when it finds markers unexpectedly in the data
 ---

 Key: PIG-648
 URL: https://issues.apache.org/jira/browse/PIG-648
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-648.patch


 The current record begin marker used in BinStorage is the consecutive 
 sequence- 0x21,0x31,0x41 - these byte correspond to the ascii characters 
 !1A. This sequence is not very strong as a marker - this results in 
 failures when the sequence occurs in the data - the markers should be control 
 characters which have a high probability of not occurring in the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-649) RandomSampleLoader does not handle skipping correctly in getNext()

2009-01-30 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-649:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed

 RandomSampleLoader does not handle skipping correctly in getNext()
 --

 Key: PIG-649
 URL: https://issues.apache.org/jira/browse/PIG-649
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-649.patch


 Currently RandomSampleLoader calls skip() on the underlying input stream 
 (BufferedPositionedInputStream) in its getNext(). The input stream may not 
 actually skip over the amount the RandomSampleLoader needs in one call. 
 RandomSampleLoader should check the return value from the skip() call and 
 ensure that skip() is called repeatedly (if necessary) till the needed number 
 of bytes are skipped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-428) TypeCastInserter does not replace projects in inner plans correctly

2009-02-02 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669687#action_12669687
 ] 

Pradeep Kamath commented on PIG-428:


Have you tried your query with top of trunk? The original issue fixed in this 
issue was when the TypeCastInserter was involved in the query. That is the case 
only when the load statement has a schema like  a = load 'bla' as (x:int, 
y:float);. In your query in the previous comment the load statement does not 
have a schema. I am wondering if the issues is somewhere else in that query but 
the error message is the same. 

 TypeCastInserter does not replace projects in inner plans correctly
 ---

 Key: PIG-428
 URL: https://issues.apache.org/jira/browse/PIG-428
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-428.patch


 The TypeCastInserter tries to replace the Project's input operator in inner 
 plans with the new foreach operator it adds. However it should replace only 
 those Projects' input where the new Foreach has been added after the operator 
 which was earlier the input to Project.
 Here is a query which fails due to this:
 {code}
 a = load 'st10k' as (name:chararray,age:int, gpa:double);
 another = load 'st10k';
 c = foreach another generate $0, $1+ 10, $2 + 10;
 d = join a by $0, c by $0;
 dump d;
 {code}
 Here is the error:
 {noformat}
 2008-09-11 23:34:28,169 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error 
 message from task (map) tip_200809051428_0045_m_00java.io.IOException: 
 Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, 
 recieved org.apache.pig.impl.io.NullableBytesWritable
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:419)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:83)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:172)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:75)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
 at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-653) Make fieldsToRead work in loader

2009-02-09 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-653:
---

Attachment: PIG-653-2.comment

A new proposal has been attached as a revision of the proposal in comment 1.

The two main changes are:
1. A new class RequiredFieldList  will be used to convey the list of required 
fields. A separate class was chosen here (rather than using the 
ListRequiredFields and boolean separately) since it gives us the flexibility 
to extend it easily in the future.
2. The new type, BAG_OF_MAP is no longer needed. So if a certain field is a bag 
(named bg) which contains a single column which is a map and we need just the 
data for only one key (say k1) from it, we can represent that by having a 
RequiredField object of Type BAG with alias bg. This object will have one 
RequiredField object in its subFields list which will be of type MAP and which 
will have index 0 to indicate this is the first subfield in the bag. This 
object inturn will have one RequiredField object in its subFields list which be 
of type BYTEARRAY and which will have alias k1. This illustrates how 
subcolumns of interest can be represented by the RequiredField class.


 Make fieldsToRead work in loader
 

 Key: PIG-653
 URL: https://issues.apache.org/jira/browse/PIG-653
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Attachments: PIG-653-2.comment


 Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
 does not provide information to load functions on what fields are needed.  We 
 need to implement a visitor that determines (where possible) which fields in 
 a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange

2009-02-11 Thread Pradeep Kamath (JIRA)
Map key type not correctly set (for use when key is null) when map plan does 
not have localrearrange


 Key: PIG-665
 URL: https://issues.apache.org/jira/browse/PIG-665
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch


KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the 
map key. This is required so that when the map key is null, we can still 
construct a valid NullableXXXWritable object to pass on to hadoop in the 
collect() call (hadoop needs a valid object even for null objects). Currently 
the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to 
figure out the key type. In a pig script which results in multiple Map reduce 
jobs, one of the jobs could have a map plan with only POLoads in it. In such a 
case, the map key type is not discovered and this results in a null being 
returned from HDataType.getWritableComparableTypes() method. This in turn will 
result in a NullPointerException in the collect().

Here is a script which can prompt this behavior:
{code}
a = load 'a.txt' as (x:int, y:int, z:int);
b = load 'b.txt' as (x:int, y:int);
b_group = group b by x;
b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
a_group = group a by (x, y);
a_aggs = foreach a_group {
generate 
flatten(group) as (x, y),
SUM(a.z) as zs;
};
join_a_b = join b_sum by x, a_aggs by x; -- the map plan for this join will 
only have two POLoads which will result in the NullPointerException at runtime 
in collect()
dump join_a_b;

{code} 

Contents of a.txt (columns are tab separated):
The first column of the first two rows is null (represented by an empty column)
{noformat}
7   8
8   9
1   20  30
1   20  40
{noformat}

Contents of b.txt (columns are tab separated):
{noformat}
7   2
1   5
1   10
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-665) Map key type not correctly set (for use when key is null) when map plan does not have localrearrange

2009-02-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-665:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed.

 Map key type not correctly set (for use when key is null) when map plan does 
 not have localrearrange
 

 Key: PIG-665
 URL: https://issues.apache.org/jira/browse/PIG-665
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-665.patch


 KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the 
 map key. This is required so that when the map key is null, we can still 
 construct a valid NullableXXXWritable object to pass on to hadoop in the 
 collect() call (hadoop needs a valid object even for null objects). Currently 
 the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to 
 figure out the key type. In a pig script which results in multiple Map reduce 
 jobs, one of the jobs could have a map plan with only POLoads in it. In such 
 a case, the map key type is not discovered and this results in a null being 
 returned from HDataType.getWritableComparableTypes() method. This in turn 
 will result in a NullPointerException in the collect().
 Here is a script which can prompt this behavior:
 {code}
 a = load 'a.txt' as (x:int, y:int, z:int);
 b = load 'b.txt' as (x:int, y:int);
 b_group = group b by x;
 b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
 a_group = group a by (x, y);
 a_aggs = foreach a_group {
 generate 
 flatten(group) as (x, y),
 SUM(a.z) as zs;
 };
 join_a_b = join b_sum by x, a_aggs by x; -- the map plan for this join will 
 only have two POLoads which will result in the NullPointerException at 
 runtime in collect()
 dump join_a_b;
 {code} 
 Contents of a.txt (columns are tab separated):
 The first column of the first two rows is null (represented by an empty 
 column)
 {noformat}
 7   8
 8   9
 1   20  30
 1   20  40
 {noformat}
 Contents of b.txt (columns are tab separated):
 {noformat}
 7   2
 1   5
 1   10
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

2009-02-18 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-545:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed.

 PERFORMANCE: Sampler for order bys does not produce a good distribution
 ---

 Key: PIG-545
 URL: https://issues.apache.org/jira/browse/PIG-545
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-545-v3.patch, PIG-545-v4.patch, WRP.patch, WRP1.patch


 In running tests on actual data, I've noticed that the final reduce of an 
 order by has skewed partitions.  Some reduces finish in a few seconds while 
 some run for 20 minutes.  Getting a better distribution should lead to much 
 better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-652) Need to give user control of OutputFormat

2009-02-19 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-652:
---

Fix Version/s: types_branch
 Assignee: Pradeep Kamath  (was: Alan Gates)
Affects Version/s: types_branch
 Hadoop Flags: [Incompatible change]
   Status: Patch Available  (was: Open)

Submitting a patch with a few changes to the way this will work. Very soon we 
will have the ability to store multiple outputs in the map or reduce phase of a 
job (https://issues.apache.org/jira/browse/PIG-627). In that scenario the 
OutputFormat will still need to be able to get a handle of the corresponding 
StoreFunc, location and schema to use for the particular output that it is 
trying to write. To Enable this a Utility class - MapRedUtil is being 
introduced which has static methods which will take a JobConf and return these 
pieces of information. When PIG-627 is implemented, these utility classes will 
hide the inner Pig implementation to map the multiple stores to the 
corresponding StoreFunc, location and schema.

The new method in StoreFunc proposed at the beginning of this issue will still 
be used to ask the StoreFunc if it will provide an OutputFormat implementation.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-652) Need to give user control of OutputFormat

2009-02-19 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-652:
---

Attachment: PIG-652.patch

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-652) Need to give user control of OutputFormat

2009-02-19 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-652:
---

Attachment: PIG-652-v2.patch

Attached new version which addresses the comment regarding having a 
serialVersionUID in StoreConfig since it its Serializable. Also removed a 
redundant import from StoreFunc.java.

Moving the string constants to a separate PropertyKeys.java file would be good 
but to be useful will need all existing free standing constants to be moved 
there - this would be good in a separate jira as suggested.

Having schemas in all PhysicalOperators might be good but would need all the 
visitors (optimizers etc) which run after LogToPhyTranslationVisitor  and which 
introduce new PhysicalOperators to also generate schema for them. Likewise 
these visitor would have to handle schema changes on operators they modify - 
this might be better pursued in a different jira is found to be worthwhile.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652-v2.patch, PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-652) Need to give user control of OutputFormat

2009-02-23 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-652:
---

Attachment: PIG-652-v3.patch

Attached a new version of the patch. Changes include:
1) Included MapredUtil.java source file which was missing in the previous patch
2) Fixed a few issues which were uncovered in some tests
2) regenerated the patch with the latest svn revision

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-591) Error handling phase four

2009-02-24 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676399#action_12676399
 ] 

Pradeep Kamath commented on PIG-591:


Code review comments 

The patch looks good to go with minor observations below:

- System.err.println() message in PgHadoopLogger.warn() seems like a debug 
statement
- In EvalFunc.progress() there is:
 log.warn(No reporter object provided to UDF  + this.getClass().getName());  
 Shouldn't this go through the PigLogger?
- If we want warning aggregation in UDF, should the UDF writer create new 
entries in PigWarning
(If so, the UDF manual should probably outline this)
- Is there a reason why initialized needs to be volatile in PigMapBase? There
should be only one Map thread in the map() function. If there is a reason for 
it to be 
volatile, does it apply to PigMapReduce, PigCombiner  and POUserFunc as well?
- In POUserFunc.instantiateFunc() should we still set the Reporter and 
PigLogger if the
assignments don't actually work and we rely on processinput() for these 
initializations?
- In DefaultAbstractBag warn() should mimic Utf8StorageConvertor
- GruntParser.java has only a whitespace change (the change should be reverted 
since earlier
there were spaces and now there is a tab).




 Error handling phase four
 -

 Key: PIG-591
 URL: https://issues.apache.org/jira/browse/PIG-591
 Project: Pig
  Issue Type: Sub-task
  Components: grunt, impl, tools
Affects Versions: types_branch
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: types_branch

 Attachments: Error_handling_phase4.patch


 Phase four of the error handling feature will address the warning message 
 cleanup and warning message aggregation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-591) Error handling phase four

2009-02-25 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-591.


  Resolution: Fixed
Hadoop Flags: [Reviewed]

Santhosh, thanks for the feature contribution. Patch commited with the 
following changes in HExecution.java:
{code}
@@ -200,7 +200,7 @@
 }
 catch (IOException e) {
 int errCode = 6009; 
-String msg = Failed to create job client;
+String msg = Failed to create job client: + e.getMessage();
 throw new ExecException(msg, errCode, PigException.BUG, e);
 }   
 }
@@ -549,11 +549,20 @@
 //this should return as soon as connection is shutdown
 int rc = p.waitFor();
 if (rc != 0) { 
-String errMsg = new String();
+StringBuilder errMsg = new StringBuilder();
 try {   
-BufferedReader br = new BufferedReader(new 
InputStreamReader(p.getErrorStream()));
-errMsg = br.readLine();
+BufferedReader br = new BufferedReader(new 
InputStreamReader(p.getInputStream()));
+String line = null;
+while((line = br.readLine()) != null) {
+errMsg.append(line);
+}
 br.close();
+br = new BufferedReader(new 
InputStreamReader(p.getErrorStream()));
+line = null;
+while((line = br.readLine()) != null) {
+errMsg.append(line);
+}
+br.close();
 } catch (IOException ioe) {} 
 int errCode = 6011; 
 StringBuilder msg = new StringBuilder(Failed to run command 
);
@@ -563,7 +572,7 @@
 msg.append(; return code: );
 msg.append(rc);
 msg.append(; error: );
-msg.append(errMsg);
+msg.append(errMsg.toString());
 throw new ExecException(msg.toString(), errCode, 
PigException.REMOTE_ENVIRONMENT);
 }
 } catch (Exception e){

{code}

These extra changes are need so that the right error message is shown when 
there is an error while connecting to DFS. Since this is the last error 
handling related patch it seemed logical to add this with this patch. The above 
change has been taken from the patch submitted for 
http://issues.apache.org/jira/browse/PIG-682. So when 
http://issues.apache.org/jira/browse/PIG-682 is finally committed this portion 
can be omitted.

 Error handling phase four
 -

 Key: PIG-591
 URL: https://issues.apache.org/jira/browse/PIG-591
 Project: Pig
  Issue Type: Sub-task
  Components: grunt, impl, tools
Affects Versions: types_branch
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: types_branch

 Attachments: Error_handling_phase4.patch, 
 Error_handling_phase4_1.patch


 Phase four of the error handling feature will address the warning message 
 cleanup and warning message aggregation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-682) Fix the ssh tunneling code

2009-02-25 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676810#action_12676810
 ] 

Pradeep Kamath commented on PIG-682:


As noted in 
https://issues.apache.org/jira/browse/PIG-591?focusedCommentId=12676808#action_12676808
 a part of this patch has already been committed as part of 
https://issues.apache.org/jira/browse/PIG-591. The portion which has already 
been committed is in HExecutionEngine.java:
{code}
@@ -200,7 +200,7 @@
 }
 catch (IOException e) {
 int errCode = 6009; 
-String msg = Failed to create job client;
+String msg = Failed to create job client: + e.getMessage();
 throw new ExecException(msg, errCode, PigException.BUG, e);
 }   
 }
@@ -549,11 +549,20 @@
 //this should return as soon as connection is shutdown
 int rc = p.waitFor();
 if (rc != 0) { 
-String errMsg = new String();
+StringBuilder errMsg = new StringBuilder();
 try {   
-BufferedReader br = new BufferedReader(new 
InputStreamReader(p.getErrorStream()));
-errMsg = br.readLine();
+BufferedReader br = new BufferedReader(new 
InputStreamReader(p.getInputStream()));
+String line = null;
+while((line = br.readLine()) != null) {
+errMsg.append(line);
+}
 br.close();
+br = new BufferedReader(new 
InputStreamReader(p.getErrorStream()));
+line = null;
+while((line = br.readLine()) != null) {
+errMsg.append(line);
+}
+br.close();
 } catch (IOException ioe) {} 
 int errCode = 6011; 
 StringBuilder msg = new StringBuilder(Failed to run command 
);
@@ -563,7 +572,7 @@
 msg.append(; return code: );
 msg.append(rc);
 msg.append(; error: );
-msg.append(errMsg);
+msg.append(errMsg.toString());
 throw new ExecException(msg.toString(), errCode, 
PigException.REMOTE_ENVIRONMENT);
 }
 } catch (Exception e){
{code}

When a new revision of this patch is generated to make the changes for the 
previous review comment, the above portion of code changes can be omitted.

 Fix the ssh tunneling code
 --

 Key: PIG-682
 URL: https://issues.apache.org/jira/browse/PIG-682
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Benjamin Reed
 Attachments: jsch-0.1.41.jar, PIG-682.patch


 Hadoop has changed a bit and the ssh-gateway code no longer works. pig needs 
 to be updated to register with the new socket framework. reporting of 
 problems also needs to be better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-652) Need to give user control of OutputFormat

2009-02-27 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-652:
---

Attachment: PIG-652-v4.patch

Attaching new patch - the only difference is:
(old code is the first input in the diff below)
{code}
 --- 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  (revision 747112)
---
 --- 
 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
   (revision 748740)
146c146
 +if(sPrepClass != null  
sPrepClass.isInstance(OutputFormat.class)) {
---
 +if(sPrepClass != null  
 OutputFormat.class.isAssignableFrom(sPrepClass)) {

{code}

The code achives checking whether the class supplied by the Loader is of type 
OutputFormat

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652-v4.patch, 
 PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-652) Need to give user control of OutputFormat

2009-02-27 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677603#action_12677603
 ] 

pkamath edited comment on PIG-652 at 2/27/09 4:06 PM:
-

Attaching new patch - the only difference is:
(old code is the first input in the diff below)
{code}
 --- 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  (revision 747112)
---
 --- 
 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
   (revision 748740)
146c146
 +if(sPrepClass != null  
sPrepClass.isInstance(OutputFormat.class)) {
---
 +if(sPrepClass != null  
 OutputFormat.class.isAssignableFrom(sPrepClass)) {

{code}

The code achives checking whether the class supplied by the Storefunc is of 
type OutputFormat

  was (Author: pkamath):
Attaching new patch - the only difference is:
(old code is the first input in the diff below)
{code}
 --- 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  (revision 747112)
---
 --- 
 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
   (revision 748740)
146c146
 +if(sPrepClass != null  
sPrepClass.isInstance(OutputFormat.class)) {
---
 +if(sPrepClass != null  
 OutputFormat.class.isAssignableFrom(sPrepClass)) {

{code}

The code achives checking whether the class supplied by the Loader is of type 
OutputFormat
  
 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652-v4.patch, 
 PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-691) BinStorage skips tuples when ^A is present in data

2009-03-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-691:
---

Fix Version/s: types_branch
Affects Version/s: types_branch
   Status: Patch Available  (was: Open)

Binstorage uses RECORD_1, RECORD_2 and RECORD_3 byte markers (the bytes 0x01, 
0x02, 0x03) as the beginning of a new record. The current bug in BinStorage is 
that in getNext(), the code looks for RECORD_1 and if it finds RECORD_1, it 
looks for RECORD_2. If it fails to find RECORD_2, it goes back to look for 
entire sequence starting with looking for RECORD_1. However this failes when we 
have the following sequence:RECORD_1-RECORD_1-RECORD_2-RECORD_3. After reading 
the second RECORD_1 in the above sequence, we should not look for RECORD_1 
again but start by looking for RECORD_2. This is an issue only when a record in 
binstorage spans two blocks and the part in the head of the second block has 
the above sequence. This can happen when the last field in the record is null 
(null is represented by the byte 0x01 which is RECORD_1). The attached patch 
fixes this issue.

 BinStorage skips tuples when ^A is present in data
 --

 Key: PIG-691
 URL: https://issues.apache.org/jira/browse/PIG-691
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: types_branch


 Pradeep found a problem with BinStorage.getNext function that causes data 
 loss. He is working on the fix

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-691) BinStorage skips tuples when ^A is present in data

2009-03-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-691:
---

Attachment: PIG-691.patch

 BinStorage skips tuples when ^A is present in data
 --

 Key: PIG-691
 URL: https://issues.apache.org/jira/browse/PIG-691
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-691.patch


 Pradeep found a problem with BinStorage.getNext function that causes data 
 loss. He is working on the fix

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-690) UNION doesn't work in the latest code

2009-03-02 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-690:
---

Attachment: PIG-690.patch

 UNION doesn't work in the latest code
 -

 Key: PIG-690
 URL: https://issues.apache.org/jira/browse/PIG-690
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
 Environment: mapred mode. local mode.has the same problem under linux.
 code is taken from trunk
Reporter: Amir Youssefi
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-690.patch


 grunt a = load 'tmp/f1' using BinStorage();
 grunt b = load 'tmp/f2' using BinStorage();
 grunt describe a;
 a: {int,chararray,int,{(int,chararray,chararray)}}
 grunt describe b;
 b: {int,chararray,int,{(int,chararray,chararray)}}
 grunt c = union a,b;
 grunt describe c;
 2009-02-27 11:51:46,012 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1052: Cannot cast bag with schema bag({(int,chararray,chararray)}) to tuple 
 with schema tuple
 Details at logfile: /homes/amiry/pig_1235735380348.log
 dump a and dump b work fine.
 Sample data provided to dev team in an e-mail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-655) Comparison of schemas of bincond operands is flawed

2009-03-02 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678210#action_12678210
 ] 

Pradeep Kamath commented on PIG-655:


I will be reviewing this patch

 Comparison of schemas of bincond operands is flawed
 ---

 Key: PIG-655
 URL: https://issues.apache.org/jira/browse/PIG-655
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: types_branch

 Attachments: PIG-655.patch


 The comparison of schemas of bincond is flawed. Instead of comparing the 
 field schemas, the type checker is comparing the schemas.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-692) when running script file, automatically set up job name based on the file name

2009-03-03 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678458#action_12678458
 ] 

Pradeep Kamath commented on PIG-692:


+1 for the change

 when running script file, automatically set up job name based on the file name
 --

 Key: PIG-692
 URL: https://issues.apache.org/jira/browse/PIG-692
 Project: Pig
  Issue Type: Improvement
  Components: tools
Affects Versions: types_branch
Reporter: Vadim Zaliva
Priority: Trivial
 Fix For: types_branch

 Attachments: pig-job-name.patch


 When running pig script from command like like this:
 pig scriptfile
 right now default job name is used. it is convenient to have it automatically 
 set up based on the script name.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-577) outer join query looses name information

2009-03-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-577.


Resolution: Fixed

 outer join query looses name information
 

 Key: PIG-577
 URL: https://issues.apache.org/jira/browse/PIG-577
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: types_branch

 Attachments: PIG-577.patch


 The following query:
 A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
 B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararray, 
 contributions: float);
 C = COGROUP A BY name, B BY name;
 D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), 
 flatten((IsEmpty(B) ? null : B));
 describe D;
 E = FOREACH D GENERATE A::gpa, B::contributions;
 Give the following error: (Even though describe shows correct information: D: 
 {group: chararray,A::name: chararray,A::age: int,A::gpa: float,B::name: 
 chararray,B::age: int,B::registration: chararray,B::contributions: float}
 java.io.IOException: Invalid alias: A::gpa in {group: 
 chararray,bytearray,bytearray}
 at org.apache.pig.PigServer.parseQuery(PigServer.java:298)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:263)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Invalid 
 alias: A::gpa in {group: chararray,bytearray,bytearray}
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.AliasFieldOrSpec(QueryParser.java:5930)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.ColOrSpec(QueryParser.java:5788)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseEvalSpec(QueryParser.java:3974)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.UnaryExpr(QueryParser.java:3871)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.CastExpr(QueryParser.java:3825)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.MultiplicativeExpr(QueryParser.java:3734)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.AdditiveExpr(QueryParser.java:3660)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.InfixExpr(QueryParser.java:3626)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItem(QueryParser.java:3552)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItemList(QueryParser.java:3462)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.GenerateStatement(QueryParser.java:3419)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.NestedBlock(QueryParser.java:2894)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.ForEachClause(QueryParser.java:2309)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:966)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:742)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:537)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:60)
 at org.apache.pig.PigServer.parseQuery(PigServer.java:295)
 ... 6 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-577) outer join query looses name information

2009-03-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-577:
---

Hadoop Flags: [Reviewed]

+1, Patch committed - thanks for the fix Santhosh.

 outer join query looses name information
 

 Key: PIG-577
 URL: https://issues.apache.org/jira/browse/PIG-577
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Santhosh Srinivasan
 Fix For: types_branch

 Attachments: PIG-577.patch


 The following query:
 A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
 B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararray, 
 contributions: float);
 C = COGROUP A BY name, B BY name;
 D = FOREACH C GENERATE group, flatten((IsEmpty(A) ? null : A)), 
 flatten((IsEmpty(B) ? null : B));
 describe D;
 E = FOREACH D GENERATE A::gpa, B::contributions;
 Give the following error: (Even though describe shows correct information: D: 
 {group: chararray,A::name: chararray,A::age: int,A::gpa: float,B::name: 
 chararray,B::age: int,B::registration: chararray,B::contributions: float}
 java.io.IOException: Invalid alias: A::gpa in {group: 
 chararray,bytearray,bytearray}
 at org.apache.pig.PigServer.parseQuery(PigServer.java:298)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:263)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Invalid 
 alias: A::gpa in {group: chararray,bytearray,bytearray}
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.AliasFieldOrSpec(QueryParser.java:5930)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.ColOrSpec(QueryParser.java:5788)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseEvalSpec(QueryParser.java:3974)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.UnaryExpr(QueryParser.java:3871)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.CastExpr(QueryParser.java:3825)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.MultiplicativeExpr(QueryParser.java:3734)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.AdditiveExpr(QueryParser.java:3660)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.InfixExpr(QueryParser.java:3626)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItem(QueryParser.java:3552)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.FlattenedGenerateItemList(QueryParser.java:3462)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.GenerateStatement(QueryParser.java:3419)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.NestedBlock(QueryParser.java:2894)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.ForEachClause(QueryParser.java:2309)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:966)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:742)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:537)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:60)
 at org.apache.pig.PigServer.parseQuery(PigServer.java:295)
 ... 6 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-04 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12679021#action_12679021
 ] 

Pradeep Kamath commented on PIG-627:


I committed multi-store-0304.patch into the multi-query branch after 
reviewing the changes.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
 Fix For: types_branch

 Attachments: multi-store-0303.patch, multi-store-0304.patch, 
 multiquery_0223.patch, multiquery_0224.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-10 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680724#action_12680724
 ] 

Pradeep Kamath commented on PIG-627:


multiquery_0306.patch seems to have a lot of code from the earlier patch ( 
multi-store-0304.patch). Richard, can you svn up your code base and regenerate 
the patch with only the changes you intended?

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680997#action_12680997
 ] 

Pradeep Kamath commented on PIG-627:


Sorry about the misunderstanding, I think I looked at a different patch. After 
reviewing the right patch, here are some comments:

The patch throws Java Exceptions like IllegalStateException. This should be 
replaced with the appropriate Exception class (like MRCompilerException) as 
specified in 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification. The 
exception should be created with the error code, error source and error message 
constructor. New error codes should be introduced if one of the existing ones 
in 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification#head-9f71d78d362c3307711f98ec9db3ee12b55e92f6
 cannot be used. If new codes are introduced, the wiki table should be updated.

The following can be used to check for file existence in 
BinStorage.determineSchema() - only in the case where the file does not exist, 
null should be returned
{code}
 public static boolean fileExists(String filename, DataStorage store)
throws IOException {
ElementDescriptor elem = store.asElement(filename);
return elem.exists() || globMatchesFiles(elem, store);
}
 {code}   

Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I 
wonder if we should have a ReverseDependencyOrderWalker since that is what the 
rootsFirst == false case will be. If we are not visiting roots to leaf, we 
really are not visiting in a dependency order - so the meaning of dependency 
order is no longer honored - this can be confusing I think. By explicitly 
naming the walker ReverseDependencyOrderWalker, the intent of walking from 
leaves to roots is more clear I think.

In POSplit currently there is a PhysicalPlan representing the merged inner 
plans (where all plans are mutually exclusive) and there is also a 
ListPhysicalPlan which has the same information in the form of a List. In the 
rest of pig code, inner plans have always been modelled as ListPhysicalPlan. 
For consistency, it is better to just have a ListPhysicalPlan to represent 
the inner plans.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681085#action_12681085
 ] 

Pradeep Kamath commented on PIG-627:


Committed patch per previous comment that the review comments will be addressed 
in the next patch - thanks Richard for the contribution. 

In general from Pig code we always want to throw known PigExceptions even for 
programming errors or internal state errors - in these cases, we just use the 
source of the Exception as PigExcetion.BUG. RuntimeException should be used 
when we want to throw an exception in a function which cannot throw any 
exceptions (like in methods from Hadoop API which we are implementing which do 
not throw any Exception)

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688339#action_12688339
 ] 

Pradeep Kamath commented on PIG-627:


Comments for Richard's patch - multiquery-phase2_0313.patch

In MultiQueryOptimizer:
- what about mr not being map only and with mr splittee? - is this not handled 
for now?
- Is the single mapper case and the single map-reduce case when the script has 
an explicit store 'file' and load 'file' - if this is so, then in
mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is 
removed - shouldn't the store remain?   
- There is common code in mergeOnlyMapperSplittee() and 
meregOnlyMapReduceSplittee() which should be moved to a function to reduce the 
code duplication.

Just want to confirm that the multi query optimization is only for map reduce 
mode - since the optimizer is being called in MapReduceLauncher

In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I 
noticed that in POSplit, it causes an exception - I think it should return the 
error whhic would later be caught in the map() or reduce() - a test to make 
sure errors do get caught and cause failures would be good.

spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of 
ReverseDependencyWalker.

The following comment in BinStorage needs to be clarified:
{noformat}
if (!FileLocalizer.fileExists(fileName, storage)) {
// At compile time in batch mode, the file may not exist
// (such as intermediate file). Just return null - the
// same way as we could's get a valid record from the input. -- 
does this actually mean the same way as we would if we did not get a valid 
record ?
return null;
}


{noformat}


 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
 multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688356#action_12688356
 ] 

Pradeep Kamath commented on PIG-627:


+1 on Gunther's patch - multiquery_explain_fix.patch. Patch has been committed 
to the multiquery branch - thanks for the fix Gunther!

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
 multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688461#action_12688461
 ] 

Pradeep Kamath commented on PIG-627:


+1 on Richard's patch -  multiquery-phase2_0323.patch, patch committed to 
multiquery branch - thanks for the contribution Richard.

A general comment for the multiquery work is to introduce some negative test 
cases (which return POStatus.STATUS_ERR from some operator in the map or reduce 
plan affected by the multiQuqeryOptimizer)  at some point.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-729) Use of default parallelism

2009-03-24 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688755#action_12688755
 ] 

Pradeep Kamath commented on PIG-729:


Another option maybe to detect mapreduce boundaries in the script which do not 
have a parallel specification and prompt the user to input a parallel number 
they want to use for all such mapreduce boundaries (default being 1). This way 
users are given an opportunity at submit time to specify parallelism if they 
forgot to do so in the script. 

 Use of default parallelism
 --

 Key: PIG-729
 URL: https://issues.apache.org/jira/browse/PIG-729
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 1.0.1
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
 Fix For: 1.0.1


 Currently, if the user does not specify the number of reduce slots using the 
 parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
 This model worked well with dynamically allocated clusters using HOD and for 
 static clusters where the default number of reduce slots was explicitly set. 
 With Hadoop 0.20, a single static cluster will be shared amongst a number of 
 queues. As a result, a common scenario is to end up with default number of 
 reducers set to one (1).
 When users migrate to Hadoop 0.20, they might see a dramatic change in the 
 performance of their queries if they had not used the parallel keyword to 
 specify the number of reducers. In order to mitigate such circumstances, Pig 
 can support one of the following:
 1. Specify a default parallelism for the entire script.
 This option will allow users to use the same parallelism for all operators 
 that do not have the explicit parallel keyword. This will ensure that the 
 scripts utilize more reducers than the default of one reducer. On the down 
 side, due to data transformations, usually operations that are performed 
 towards the end of the script will need smaller number of reducers compared 
 to the operators that appear at the beginning of the script.
 2. Display a warning message for each reduce side operator that does have the 
 use of the explicit parallel keyword. Proceed with the execution.
 3. Display an error message indicating the operator that does not have the 
 explicit use of the parallel keyword. Stop the execution.
 Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-24 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688957#action_12688957
 ] 

Pradeep Kamath commented on PIG-627:


+1 - committed patch by Gunther to merge changes in trunk to multiquery branch 
- thanks for the contribution Gunther.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, 
 multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-03-25 Thread Pradeep Kamath (JIRA)
Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
closed error on large input
---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath


Order by has a sampling job which samples the input and creates a sorted list 
of sample items. CUrrently the number of items sampled is 100 per map task. So 
if the input is large resulting in many maps (say 50,000) the sample is big. 
This sorted sample is stored on dfs. The WeightedRangePartitioner computes 
quantile boundaries and weighted probabilities for repeating values in each map 
by reading the samples file from DFS. In queries with many maps (in the order 
of 50,000) the dfs read of the sample file fails with FileSystem closed 
error. This seems to point to a dfs issue wherein a big dfs file being read 
simultaneously by many dfs clients (in this case all maps) causes the clients 
to be closed. However on the pig side, loading the sample from each map in the 
final map reduce job and computing the quantile boundaries and weighted 
probabilities is inefficient. We should do this computation through a 
FindQuantiles udf in the same map reduce job which produces the sorted samples. 
This way lesser data is written to dfs and in the final map reduce job, the 
weightedRangePartitioner needs to just load the computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-01 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-733:
---

Fix Version/s: 0.3.0
Affects Version/s: 0.2.0
   Status: Patch Available  (was: Open)

Attached patch which implements the fix described in the description of the 
issue

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-01 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-733:
---

Attachment: PIG-733.patch

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-01 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694859#action_12694859
 ] 

Pradeep Kamath commented on PIG-627:


+1, patch committed - thanks for the contribution Gunther.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-06 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696244#action_12696244
 ] 

Pradeep Kamath commented on PIG-733:


Tests are not included in this patch since there are existing tests for order 
by.

All core unit tests did pass and finbugs gave the same number of warnings with 
and without the patch (output below). The excess warnings produced by the patch 
have been addressed in the new version of the patch (PIG-733-v2.patch).

{noformat}
=== CORE UNIT TESTS OUTPUT WITH PATCH
[prade...@afterside:/tmp/PIG-733/trunk]


test-core:
[mkdir] Created dir: /tmp/PIG-733/trunk/build/test/logs
[junit] Running org.apache.pig.test.TestAdd
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.056 sec
...
[junit] Running org.apache.pig.test.TestTypeCheckingValidatorNoSchema
[junit] Tests run: 13, Failures: 0, Errors: 0, Time elapsed: 0.629 sec
[junit] Running org.apache.pig.test.TestUnion
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 49.94 sec

test-contrib:

BUILD SUCCESSFUL
Total time: 77 minutes 47 seconds

=== FINDBUGS OUTPUT WITH PATCH
[prade...@afterside:/tmp/PIG-733/trunk]

[prade...@chargesize:/tmp/PIG-733/trunk]ant 
-Dfindbugs.home=/homes/pradeepk/findbugs-1.3.8 findbugs
Buildfile: build.xml
...
findbugs:
[mkdir] Created dir: /tmp/PIG-733/trunk/build/test/findbugs
 [findbugs] Executing findbugs from ant task
 [findbugs] Running FindBugs...
 [findbugs] Warnings generated: 665
 [findbugs] Calculating exit code...
 [findbugs] Setting 'bugs found' flag (1)
 [findbugs] Exit code set to: 1
 [findbugs] Java Result: 1
 [findbugs] Output saved to 
/tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.xml
 [xslt] Processing 
/tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.xml to 
/tmp/PIG-733/trunk/build/test/findbugs/pig-findbugs-report.html
 [xslt] Loading stylesheet 
/homes/pradeepk/findbugs-1.3.8/src/xsl/default.xsl

=== FINDBUGS OUTPUT WITHOUT PATCH
[prade...@chargesize:/tmp/svncheckout/trunk]ant 
-Dfindbugs.home=/homes/pradeepk/findbugs-1.3.8 findbugs
Buildfile: build.xml

check-for-findbugs:

...
findbugs:
[mkdir] Created dir: /tmp/svncheckout/trunk/build/test/findbugs
 [findbugs] Executing findbugs from ant task
 [findbugs] Running FindBugs...
 [findbugs] Warnings generated: 665
 [findbugs] Calculating exit code...
 [findbugs] Setting 'bugs found' flag (1)
 [findbugs] Exit code set to: 1
 [findbugs] Java Result: 1
 [findbugs] Output saved to 
/tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.xml
 [xslt] Processing 
/tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.xml to 
/tmp/svncheckout/trunk/build/test/findbugs/pig-findbugs-report.html
 [xslt] Loading stylesheet 
/homes/pradeepk/findbugs-1.3.8/src/xsl/default.xsl



{noformat}

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-06 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-733:
---

Attachment: PIG-733-v2.patch

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733-v2.patch, PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-06 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696350#action_12696350
 ] 

Pradeep Kamath commented on PIG-627:


+1, patch committed. Thanks for the contribution Gunther!

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs FileSystem closed error on large input

2009-04-09 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-733:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed

 Order by sampling dumps entire sample to hdfs which causes dfs FileSystem 
 closed error on large input
 ---

 Key: PIG-733
 URL: https://issues.apache.org/jira/browse/PIG-733
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-733-v2.patch, PIG-733.patch


 Order by has a sampling job which samples the input and creates a sorted list 
 of sample items. CUrrently the number of items sampled is 100 per map task. 
 So if the input is large resulting in many maps (say 50,000) the sample is 
 big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
 computes quantile boundaries and weighted probabilities for repeating values 
 in each map by reading the samples file from DFS. In queries with many maps 
 (in the order of 50,000) the dfs read of the sample file fails with 
 FileSystem closed error. This seems to point to a dfs issue wherein a big 
 dfs file being read simultaneously by many dfs clients (in this case all 
 maps) causes the clients to be closed. However on the pig side, loading the 
 sample from each map in the final map reduce job and computing the quantile 
 boundaries and weighted probabilities is inefficient. We should do this 
 computation through a FindQuantiles udf in the same map reduce job which 
 produces the sorted samples. This way lesser data is written to dfs and in 
 the final map reduce job, the weightedRangePartitioner needs to just load the 
 computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-739) Filter in foreach seems to drop records resulting in decreased count of records

2009-04-16 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-739.


Resolution: Duplicate
  Assignee: Pradeep Kamath

This issue has the same root cause of PIG-514 - hence marking this duplicate - 
the fix for this issue will also be tracked in PIG-514

 Filter in foreach seems to drop records resulting in decreased count of 
 records
 ---

 Key: PIG-739
 URL: https://issues.apache.org/jira/browse/PIG-739
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: filter_distinctbug.pig, testdata


 I have a Pig script in which I count the number of distinct records resulting 
 from the filter, this statement is embedded in a foreach. The number of 
 records I get with alias  TESTDATA_AGG_2 is 1.
 {code}
 TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, 
 testid:chararray, userid: chararray, sessionid:chararray, value:long, 
 flag:int);
 TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '123080040' and 
 timestamp lt '123080400' and value != 0);
 TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
 TESTDATA_AGG = foreach TESTDATA_GROUP {
 A = filter TESTDATA_FILTERED by (userid eq sessionid);
 C = distinct A.userid;
 generate group as testid, COUNT(TESTDATA_FILTERED) as 
 counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as 
 total_flags;
 }
 TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
 -- count records generated through nested foreach which contains distinct
 TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
 --explain TESTDATA_AGG_2;
 dump TESTDATA_AGG_2;
 --RESULT (1L)
 {code}
 But when I do the counting of records without the filter and distinct in the 
 foreach I get a different value (20L)
 {code}
 TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, 
 testid:chararray, userid: chararray, sessionid:chararray, value:long, 
 flag:int);
 TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '123080040' and 
 timestamp lt '123080400' and value != 0);
 TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
 -- count records generated through simple foreach
 TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, 
 COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as 
 total_flags;
 TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
 TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
 dump TESTDATA_AGG2_2;
 --RESULT (20L)
 {code}
 Attaching testdata

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH

2009-04-16 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699801#action_12699801
 ] 

Pradeep Kamath commented on PIG-514:


I am currently working on implementing the above proposal since I have not seen 
any objections. After making the core changes to implement the above proposal, 
I validated that it fixed the issue reported here and also in PIG-739 and 
PIG-710. I need to add a few more changes to make the patch complete - will 
supply a patch once done.

 COUNT returns no results as a result of two filter statements in FOREACH
 

 Key: PIG-514
 URL: https://issues.apache.org/jira/browse/PIG-514
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
 Attachments: mystudentfile.txt


 For the following piece of sample code in FOREACH which counts the filtered 
 student records based on record_type == 1 and scores and also on record_type 
 == 0 does not seem to return any results.
 {code}
 mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
 --keep only what we need
 mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  
 scores ;
 --group
 mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
 myfinaldata = FOREACH mydata_grouped {
  myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
  myfilter2 = FILTER mydata_filtered BY record_type == 0;
  GENERATE FLATTEN(group),
 -- Only this count causes the problem ??
   COUNT(myfilter1) as col2,
   SUM(myfilter2.scores) as col3,
   COUNT(myfilter2) as col4;  };
 --these set of statements confirm that the count on the  filters returns 1
 --mycountdata = FOREACH mydata_grouped
 --{
 --  myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == 
 scores;
 --  GENERATE
 --  COUNT(myfilter1) as colcount;
 --};
 --dump mycountdata;
 dump myfinaldata;
 {code}
 But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems 
 to work with the following results..
 (0,22,45.0,2L)
 (0,24,133.0,6L)
 (0,25,22.0,1L)
 Also I have tried to verify if this is a issue with the {code} 
 COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the 
 case.
 If {code}  dump mycountdata; {code} is uncommented it returns:
 (1L)
 (1L)
 I am attaching the tab separated 'mystudentfile.txt' file used in this Pig 
 script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on 
 these filters??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-20 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700925#action_12700925
 ] 

Pradeep Kamath commented on PIG-627:


reviewed error_handling_0416.patch for additional changes per comment: 
https://issues.apache.org/jira/browse/PIG-627?focusedCommentId=1260page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_1260.
 +1, committed after removing the javadoc related changes which were already 
committed in the previous commit.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-04-21 Thread Pradeep Kamath (JIRA)
Empty complex constants (empty bag, empty tuple and empty map) should be 
supported
--

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Priority: Minor


We should be able to create empty bag constant using {}, empty tuple constant 
using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-514) COUNT returns no results as a result of two filter statements in FOREACH

2009-04-22 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-514.


   Resolution: Fixed
Fix Version/s: 0.3.0
 Hadoop Flags: [Reviewed]

Patch committed with the change in previous comment. 

 COUNT returns no results as a result of two filter statements in FOREACH
 

 Key: PIG-514
 URL: https://issues.apache.org/jira/browse/PIG-514
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: mystudentfile.txt, PIG-514.patch


 For the following piece of sample code in FOREACH which counts the filtered 
 student records based on record_type == 1 and scores and also on record_type 
 == 0 does not seem to return any results.
 {code}
 mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
 --keep only what we need
 mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  
 scores ;
 --group
 mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
 myfinaldata = FOREACH mydata_grouped {
  myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
  myfilter2 = FILTER mydata_filtered BY record_type == 0;
  GENERATE FLATTEN(group),
 -- Only this count causes the problem ??
   COUNT(myfilter1) as col2,
   SUM(myfilter2.scores) as col3,
   COUNT(myfilter2) as col4;  };
 --these set of statements confirm that the count on the  filters returns 1
 --mycountdata = FOREACH mydata_grouped
 --{
 --  myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == 
 scores;
 --  GENERATE
 --  COUNT(myfilter1) as colcount;
 --};
 --dump mycountdata;
 dump myfinaldata;
 {code}
 But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems 
 to work with the following results..
 (0,22,45.0,2L)
 (0,24,133.0,6L)
 (0,25,22.0,1L)
 Also I have tried to verify if this is a issue with the {code} 
 COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the 
 case.
 If {code}  dump mycountdata; {code} is uncommented it returns:
 (1L)
 (1L)
 I am attaching the tab separated 'mystudentfile.txt' file used in this Pig 
 script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on 
 these filters??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702005#action_12702005
 ] 

Pradeep Kamath commented on PIG-627:


All the work till now (phase 1 and phase2) has now been committed to trunk. A 
tag (pre-multiquery-phase2) was created prior to commiting the multi query work 
since this a significantly big patch. The tag will serve as a baseline to trace 
down regressions.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: doc-fix.patch, error_handling_0415.patch, 
 error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-775) PORelationToExprProject should create a NonSpillableDataBag to create empty bags

2009-04-24 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-775.


Resolution: Fixed

Patch committed.

 PORelationToExprProject should create a NonSpillableDataBag to create empty 
 bags
 

 Key: PIG-775
 URL: https://issues.apache.org/jira/browse/PIG-775
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
Priority: Minor
 Fix For: 0.3.0

 Attachments: PIG-775.patch


 PORelationToExprProject currently uses BagFactory.newDefaultBag() to create 
 an empty bag in cases where it has to send an empty bag on EOP - each such 
 empty bag created will be registered with the SpillableMemoryManager as a 
 spillable bag. Since it is an empty bag, it really should not be registered 
 as a spillable bag. For this, NonSpillableDataBag can be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-05-07 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707064#action_12707064
 ] 

Pradeep Kamath commented on PIG-802:


PIG-744 is a duplicate - will be marking that one as duplicate.

Pasting the summary from PIG-744 which has a little more detail:
Currently order by results in multiple map reduce jobs (2 or 3 depending on the 
script) of which the last one does the actual ordering. In this last map reduce 
job, we create a bag of values (each value being the entire tuple that is 
getting sorted) for each sort key(s) using POPackage in the reduce phase. Then 
we turn around and flatten the bag in the foreach following the package. So 
there is really no need for the bag. But to be generic and use the existing 
operators, we can be more efficient by tagging the POPackage to create bags 
which are backed by the Hadoop iterator itself. This way we do not create a bag 
by making a copy of each tuple from the hadoop iterator. This should help both 
performance and scalability by making better use of memory.

 PERFORMANCE: not creating bags for ORDER BY
 ---

 Key: PIG-802
 URL: https://issues.apache.org/jira/browse/PIG-802
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Order by should be changed to not use POPackage to put all of the tuples in a 
 bag on the reduce side, as the bag is just immediately flattened. It can 
 instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-12 Thread Pradeep Kamath (JIRA)
PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop 
values iterator)


 Key: PIG-807
 URL: https://issues.apache.org/jira/browse/PIG-807
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


Currently all bags resulting from a group or cogroup are materialized as bags 
containing all of the contents. The issue with this is that if a particular key 
has many corresponding values, all these values get stuffed in a bag which may 
run out of memory and hence spill causing slow down in performance and sometime 
memory exceptions. In many cases, the udfs which use these bags coming out a 
group and cogroup only need to iterate over the bag in a unidirectional 
read-once manner. This can be implemented by having the bag implement its 
iterator by simply iterating over the underlying hadoop iterator provided in 
the reduce. This kind of a bag is also needed in 
http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
this issue too. The other part of this issue is to have some way for the udfs 
to communicate to Pig that any input bags that they need are read once bags . 
This can be achieved by having an Interface - say UsesReadOnceBags  which is 
serves as a tag to indicate the intent to Pig. Pig can then rewire its 
execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-05-12 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708551#action_12708551
 ] 

Pradeep Kamath commented on PIG-802:


Adding some more details:
A new kind of bag - ReadOnceBag needs to be implemented. This bag will have 
reference to the key  currently being processed and the iterator to values 
provided by hadoop in reduce(). The ReadOnceBag's iterator will simply iterate 
over the hadoop iterator at each call and construct a tuple by using the key 
and value (see POPackage.java for details on how this is done). POPackage 
should also be changed or a new class introduced which creates ReadOnceBags 
instead of regular bags. This creation of the bag should only initialize the 
bag with the key and iterator.

 PERFORMANCE: not creating bags for ORDER BY
 ---

 Key: PIG-802
 URL: https://issues.apache.org/jira/browse/PIG-802
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Order by should be changed to not use POPackage to put all of the tuples in a 
 bag on the reduce side, as the bag is just immediately flattened. It can 
 instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-804) problem with lineage with double map redirection

2009-05-13 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-804:
---

Fix Version/s: 0.3.0
Affects Version/s: 0.2.1
   Status: Patch Available  (was: Open)

 problem with lineage with double map redirection
 

 Key: PIG-804
 URL: https://issues.apache.org/jira/browse/PIG-804
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: 0.3.0


 v1   = load 'data' as (s,m,l);
 v2   = foreach  v1  GENERATE
 s#'src_spaceid' AS vspaceid ;
 v3   = foreach  v2  GENERATE
 (chararray)vspaceid#'foo';
 explain v3;
 The last cast does not have a loader associated with it and as the result the 
 script fails on the backend with the following error: Received a bytearray 
 from the UDF. Cannot determine how to convert the bytearray to string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-804) problem with lineage with double map redirection

2009-05-13 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-804:
---

Attachment: PIG-804.patch

The root cause was in the parsers, in CastExp(), a getFieldSchema() was being 
called on the target operand of the cast to get the alias. This had a side 
effect of setting up lineage information (i.e. the canonical map in the 
operand). Apparently this place in the code is early for setting up lineage 
information since operators may be added/removed later on due to optimizations. 
This should be done at a later safe point (this change will be tracked in 
PIG-808). For a fix now, unsetFieldSchema() is called to unset the lineage 
information.

 problem with lineage with double map redirection
 

 Key: PIG-804
 URL: https://issues.apache.org/jira/browse/PIG-804
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-804.patch


 v1   = load 'data' as (s,m,l);
 v2   = foreach  v1  GENERATE
 s#'src_spaceid' AS vspaceid ;
 v3   = foreach  v2  GENERATE
 (chararray)vspaceid#'foo';
 explain v3;
 The last cast does not have a loader associated with it and as the result the 
 script fails on the backend with the following error: Received a bytearray 
 from the UDF. Cannot determine how to convert the bytearray to string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-808) getFieldSchema() in ExpressionOperators also sets up lineage information - this can cause issues if getFieldSchema() is called too early

2009-05-13 Thread Pradeep Kamath (JIRA)
getFieldSchema() in ExpressionOperators also sets up lineage information - this 
can cause issues if getFieldSchema() is called too early


 Key: PIG-808
 URL: https://issues.apache.org/jira/browse/PIG-808
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


See PIG-804 for a use case which exposes this bug. We should probably be 
setting up lineage information outside getFieldSchema() through a visitor at a 
point where we know it is safe - (just before TypeCheckingVisitor?). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-05-21 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711769#action_12711769
 ] 

Pradeep Kamath commented on PIG-802:


Review comments:
In MRCompiler, does POPackageLite need to be used in the following too:
{noformat}
if (limit!=-1) {
 POPackage pkg_c = new POPackage(new 
OperatorKey(scope,nig.getNextNodeId(scope)));
...
}
{noformat}

In POPackage, the following declarations :
{noformat}
IteratorNullableTuple tupIter; 

Object key; 
{noformat}
should have protected access specifier to make the intent that these are used 
in POPackageLite explicit.

In ReadOnceBag.equals() you could also check if the keyInfo maps are equal.

The getValueTuple() in ReadOnceBag had duplicate code from 
POPackage.getValueTuple(). Instead of having the same code in two places, I am 
wondering if you could just construct ReadOnceBag with a POPackageLite instance 
passed in the constructor. Then if you make the POPackageLite.getValueTuple() 
method public, you can just invoke it from ReadOnceBag code. This way the code 
remains in one place. 

 PERFORMANCE: not creating bags for ORDER BY
 ---

 Key: PIG-802
 URL: https://issues.apache.org/jira/browse/PIG-802
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: OrderByOptimization.patch


 Order by should be changed to not use POPackage to put all of the tuples in a 
 bag on the reduce side, as the bag is just immediately flattened. It can 
 instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-05-21 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711811#action_12711811
 ] 

Pradeep Kamath commented on PIG-802:


I think even in the future if ReadOnceBags are used in places other than order 
by, they would need to be used immediately after a POPackageLite. So tying the 
two together is not bad and would reduce code duplication. 

 PERFORMANCE: not creating bags for ORDER BY
 ---

 Key: PIG-802
 URL: https://issues.apache.org/jira/browse/PIG-802
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: OrderByOptimization.patch


 Order by should be changed to not use POPackage to put all of the tuples in a 
 bag on the reduce side, as the bag is just immediately flattened. It can 
 instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-814) Make Binstorage more robust when data contains record markers

2009-05-21 Thread Pradeep Kamath (JIRA)
Make Binstorage more robust when data contains record markers
-

 Key: PIG-814
 URL: https://issues.apache.org/jira/browse/PIG-814
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0


When the inputstream for BinStorage is at a position where the data has the 
record marker sequence, the code incorrectly assumes that it is at the 
beginning of a record (tuple) and calls DataReaderWriter.readDatum() trying to 
read the tuple. The problem is more likely when RandomSampleLoader (used in 
order by implementation) skips the input stream for sampling and calls 
Binstorage.getNext(). The code should be more robust in such cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-804) problem with lineage with double map redirection

2009-05-26 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-804:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch was commited on May 13 2009.

 problem with lineage with double map redirection
 

 Key: PIG-804
 URL: https://issues.apache.org/jira/browse/PIG-804
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Olga Natkovich
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-804.patch


 v1   = load 'data' as (s,m,l);
 v2   = foreach  v1  GENERATE
 s#'src_spaceid' AS vspaceid ;
 v3   = foreach  v2  GENERATE
 (chararray)vspaceid#'foo';
 explain v3;
 The last cast does not have a loader associated with it and as the result the 
 script fails on the backend with the following error: Received a bytearray 
 from the UDF. Cannot determine how to convert the bytearray to string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-05-26 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713181#action_12713181
 ] 

Pradeep Kamath commented on PIG-802:


Changes look good - still have a comment about the change in MRCompiler.java:
In MRCompiler, does POPackageLite need to be used in the following too:

{noformat}
if (limit!=-1) {
 POPackage pkg_c = new POPackage(new 
OperatorKey(scope,nig.getNextNodeId(scope)));
...
}
{noformat}



 PERFORMANCE: not creating bags for ORDER BY
 ---

 Key: PIG-802
 URL: https://issues.apache.org/jira/browse/PIG-802
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: OrderByOptimization.patch


 Order by should be changed to not use POPackage to put all of the tuples in a 
 bag on the reduce side, as the bag is just immediately flattened. It can 
 instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-816) PigStorage() does not accept Unicode characters in its contructor

2009-05-29 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-816:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed.

 PigStorage() does not accept Unicode characters in its contructor 
 --

 Key: PIG-816
 URL: https://issues.apache.org/jira/browse/PIG-816
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
Priority: Critical
 Fix For: 0.3.0

 Attachments: PIG-816.patch, pig_1243043613713.log


 Simple Pig script which uses Unicode characters in the PigStorage() 
 constructor fails with the following error:
 {code}
 studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, 
 age:int,gpa:float);
 X2 = GROUP studenttab by age;
 Y2 = FOREACH X2 GENERATE group, COUNT(studenttab);
 store Y2 into '/user/viraj/y2' using PigStorage('\u0001');
 {code}
 
 ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate 
 exception from backend error: org.apache.hadoop.ipc.RemoteException: 
 java.io.IOException: java.lang.RuntimeException: 
 org.xml.sax.SAXParseException: Character reference #1 is an invalid XML 
 character.
 
 Attaching log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-06-01 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715325#action_12715325
 ] 

Pradeep Kamath commented on PIG-796:


A few comments:
- In TestPOCast.java the variables can be named as something like 
opWithInputTypeAsByteArray for the POCast objects since the intent is not so 
clear with the current names
- In POCast.java you can check for the realType inside the catch clause rather 
than before trying to cast to ByteArray. This way, if the cast to ByteArray is 
always successful, we will not be incurring the overhead of the 
if(realType==null) check
- In POCast.java, you can avoid catching ExecException and checking for 
errorCode == 1071. Since the getNext() call in POCast already throws 
ExecException, you can just let ExecExceptions from DataType.toXXX() methods 
bubble out.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-796:
---

Status: Patch Available  (was: Open)

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-796:
---

Status: Open  (was: Patch Available)

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-796:
---

Status: Patch Available  (was: Open)

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-796:
---

Status: Open  (was: Patch Available)

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-04 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-796:
---

   Resolution: Fixed
Fix Version/s: 0.3.0
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Patch commited - thanks for contributing Ashutosh!

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Fix For: 0.3.0

 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-05 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

Attachment: PIG-835.patch

The root cause of the issue is that the current multiQueryOptimizer checks if 
the map key is of the same type for different map plans it merges. If they are 
of different types, it ensures that the type is made tuple for all map plans - 
this implies keys which are not tuples will be wrapped in an extra tuple and 
keys which are already of Tuple type will be left alone (this is ensured in 
POLocalRearrange). However the Demux operator which passes the key and bag of 
values to the merged reduce plan currently always unwraps the tuple whenever 
the map keys are different. This results in unwrapping of keys which were 
originally tuples and should not be unwrapped. 

The attached patch fixes this by storing an array of boolean flags in the Demux 
operator to indicates which map keys are wrapped and which are not so that 
unwrapping occurs only in cases where the original map key was not already a 
tuple and was wrapped.

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-838) Parser does not handle ctrl-m ('\u000d') as argument to PigStorage

2009-06-05 Thread Pradeep Kamath (JIRA)
Parser does not handle ctrl-m ('\u000d') as argument to PigStorage
--

 Key: PIG-838
 URL: https://issues.apache.org/jira/browse/PIG-838
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath


An script which has 
a = load 'input' using PigStorage('\u000d');
 
produces the following error:

2009-06-05 14:47:49,241 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Lexical error at line 1, column 47.  Encountered: 
\r (13), after : \'


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-05 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

Status: Patch Available  (was: Open)

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-08 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

Status: Open  (was: Patch Available)

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-08 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

Attachment: PIG-835-v2.patch

New patch with findbugs warnings addressed - essentially findbugs wanted the 
public static members in PigNUllableWritable to be marked final.

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835-v2.patch, PIG-835.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-08 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

Status: Patch Available  (was: Open)

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835-v2.patch, PIG-835.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort

2009-06-09 Thread Pradeep Kamath (JIRA)
PERFORMANCE: The sample MR job in order by implementation can use Hadoop 
sorting instead of doing a POSort
--

 Key: PIG-841
 URL: https://issues.apache.org/jira/browse/PIG-841
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


Currently the sample map reduce job in order by implementation does the 
following:
 - sample 100 records from each map
 - group all on the above output
 - sort the output bag from the above grouping on keys of the order by
 - give the sorted bag to FindQuantiles udf


The steps 2 and 3 above can be replaced by
- group the sample output by the order by key and set parallelism of the group 
to 1 so that output of the group goes to one reducer. Since Hadoop ensures the 
output of the group is sorted by key we get sorting for free without using 
POSort 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort

2009-06-09 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717835#action_12717835
 ] 

Pradeep Kamath commented on PIG-841:


This mechanism can be used for any join which requires sampling like the one 
described in http://wiki.apache.org/pig/PigSkewedJoinSpec

 PERFORMANCE: The sample MR job in order by implementation can use Hadoop 
 sorting instead of doing a POSort
 --

 Key: PIG-841
 URL: https://issues.apache.org/jira/browse/PIG-841
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


 Currently the sample map reduce job in order by implementation does the 
 following:
  - sample 100 records from each map
  - group all on the above output
  - sort the output bag from the above grouping on keys of the order by
  - give the sorted bag to FindQuantiles udf
 The steps 2 and 3 above can be replaced by
 - group the sample output by the order by key and set parallelism of the 
 group to 1 so that output of the group goes to one reducer. Since Hadoop 
 ensures the output of the group is sorted by key we get sorting for free 
 without using POSort 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-09 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch commited to both trunk and branch-0.3

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835-v2.patch, PIG-835.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan

2009-06-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-846:
---

Attachment: PIG-846.patch

 MultiQuery optimization in some cases has an issue when there is a split in 
 the map plan 
 -

 Key: PIG-846
 URL: https://issues.apache.org/jira/browse/PIG-846
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-846.patch


 The following script produces the error that follows:
 {noformat}
 A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); 
 B = FOREACH A GENERATE f0, f1, f2, f3, f4;
 B1 = foreach B generate f0, f1, f2;
 C = GROUP B1 BY (f1, f2);
 STORE C into 'foo1';
 B2 = FOREACH B GENERATE f0, f3, f4;
 E = GROUP B2 BY (f3, f4);
 STORE E into 'foo2';
 F = FOREACH A GENERATE f0, f5, f6, f7, f8;
 F1 = FOREACH F GENERATE f0, f5,f6;
 G = GROUP F1 BY (f5, f6);
 STORE G into 'foo3';
 F2  = FOREACH F GENERATE f0, f7, f8;
 I = GROUP F2 BY (f7, f8);
 STORE I into 'foo4';
 {noformat}
 Exception encountered during execution:
 {noformat}
 java.lang.NullPointerException
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-847) Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag

2009-06-12 Thread Pradeep Kamath (JIRA)
Setting twoLevelAccessRequired field in a bag schema should not be required to 
access fields in the tuples of the bag
-

 Key: PIG-847
 URL: https://issues.apache.org/jira/browse/PIG-847
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath


Currently Pig interprets the result type of a relation as a bag. However the 
schema of the relation directly contains the schema describing the fields in 
the tuples for the relation. However when a udf wants to return a bag or if 
there is a bag in input data or if the user creates a bag constant, the schema 
of the bag has one field schema which is that of the tuple. The Tuple's schema 
has the types of the fields. To be able to access the fields from the bag 
directly in such a case by using something like bagname.fieldname or 
bag.fieldposition, the schema of the bag should have the twoLevelAccess set 
to true so that pig's type system can get traverse the tuple schema and get to 
the field in question. This is confusing - we should try and see if we can 
avoid needing this extra flag. A possible solution is to treat bags the same 
way - whether they represent relations or real bags. Another way is to 
introduce a special relation datatype for the result type of a relation and 
bag type would be used only for true bags. In this case, we would always need 
bag schema to have a tuple schema which would describe the fields. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-848) Explain output sometimes may not match the exact plan that is executed in terms of the order in which inner plans and operators are presented - (semantically the plans are th

2009-06-12 Thread Pradeep Kamath (JIRA)
Explain output sometimes may not match the exact plan that is executed in terms 
of the order in which inner plans and operators are presented - (semantically 
the plans are the same)
-

 Key: PIG-848
 URL: https://issues.apache.org/jira/browse/PIG-848
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath


The visitors used for explain and in MRCompiler do not guarantee order - hence 
the plan shown in explain output may not match the plan that is finally 
executed. This is not a bug but makes debugging harder. If the plan that is 
executed is different from the one in explain, it would still be the same in 
terms of semantics - the difference would only be in the order of inner plans 
and operators. It would be nice if we could have an order preserving way of 
showing explain output which would also be used to construct the plan (MRPlan) 
which is finally executed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan

2009-06-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-846:
---

Status: Open  (was: Patch Available)

Will be resubmitting a new patch - just realized that a few unit tests are 
broken

 MultiQuery optimization in some cases has an issue when there is a split in 
 the map plan 
 -

 Key: PIG-846
 URL: https://issues.apache.org/jira/browse/PIG-846
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-846.patch


 The following script produces the error that follows:
 {noformat}
 A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); 
 B = FOREACH A GENERATE f0, f1, f2, f3, f4;
 B1 = foreach B generate f0, f1, f2;
 C = GROUP B1 BY (f1, f2);
 STORE C into 'foo1';
 B2 = FOREACH B GENERATE f0, f3, f4;
 E = GROUP B2 BY (f3, f4);
 STORE E into 'foo2';
 F = FOREACH A GENERATE f0, f5, f6, f7, f8;
 F1 = FOREACH F GENERATE f0, f5,f6;
 G = GROUP F1 BY (f5, f6);
 STORE G into 'foo3';
 F2  = FOREACH F GENERATE f0, f7, f8;
 I = GROUP F2 BY (f7, f8);
 STORE I into 'foo4';
 {noformat}
 Exception encountered during execution:
 {noformat}
 java.lang.NullPointerException
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan

2009-06-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-846:
---

Attachment: PIG-846-v2.patch

New patch - the only change is to not add extra information in 
POLocalRearrange.name() - was in the earlier patch only to add more information 
in explain outputs but this breaks some unit tests.

TestHBaseStorage unit test still fails for me but the failure is not related to 
the changes in the patch - am assuming that is an environment issue on my 
machine.

 MultiQuery optimization in some cases has an issue when there is a split in 
 the map plan 
 -

 Key: PIG-846
 URL: https://issues.apache.org/jira/browse/PIG-846
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-846-v2.patch, PIG-846.patch


 The following script produces the error that follows:
 {noformat}
 A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); 
 B = FOREACH A GENERATE f0, f1, f2, f3, f4;
 B1 = foreach B generate f0, f1, f2;
 C = GROUP B1 BY (f1, f2);
 STORE C into 'foo1';
 B2 = FOREACH B GENERATE f0, f3, f4;
 E = GROUP B2 BY (f3, f4);
 STORE E into 'foo2';
 F = FOREACH A GENERATE f0, f5, f6, f7, f8;
 F1 = FOREACH F GENERATE f0, f5,f6;
 G = GROUP F1 BY (f5, f6);
 STORE G into 'foo3';
 F2  = FOREACH F GENERATE f0, f7, f8;
 I = GROUP F2 BY (f7, f8);
 STORE I into 'foo4';
 {noformat}
 Exception encountered during execution:
 {noformat}
 java.lang.NullPointerException
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-835) Multiquery optimization does not handle the case where the map keys in the split plans have different key types (tuple and non tuple key type)

2009-06-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-835:
---

Attachment: PIG-846-v2.patch

New patch - the only change is to not add extra information in 
POLocalRearrange.name() - was in the earlier patch only to add more information 
in explain outputs but this breaks some unit tests. 

TestHBaseStorage unit test still fails for me but the failure is not related to 
the changes in the patch - am assuming that is an environment issue on my 
machine.

 Multiquery optimization does not handle the case where the map keys in the 
 split plans have different key types (tuple and non tuple key type)
 --

 Key: PIG-835
 URL: https://issues.apache.org/jira/browse/PIG-835
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-835-v2.patch, PIG-835.patch, PIG-846-v2.patch


 A query like the following results in an exception on execution:
 {noformat}
 a = load 'mult.input' as (name, age, gpa);
 b = group a ALL;
 c = foreach b generate group, COUNT(a);
 store c into 'foo';
 d = group a by (name, gpa);
 e = foreach d generate flatten(group), MIN(a.age);
 store e into 'bar';
 {noformat}
 Exception on execution:
 09/06/04 16:56:11 INFO mapred.TaskInProgress: Error from 
 attempt_200906041655_0001_r_00_3: java.lang.ClassCastException: 
 java.lang.String cannot be cast to org.apache.pig.data.Tuple
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:312)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:248)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:238)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:320)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:288)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-846) MultiQuery optimization in some cases has an issue when there is a split in the map plan

2009-06-12 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-846:
---

Status: Patch Available  (was: Open)

 MultiQuery optimization in some cases has an issue when there is a split in 
 the map plan 
 -

 Key: PIG-846
 URL: https://issues.apache.org/jira/browse/PIG-846
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.3.0

 Attachments: PIG-846-v2.patch, PIG-846.patch


 The following script produces the error that follows:
 {noformat}
 A = LOAD 'input.txt' as (f0, f1, f2, f3, f4, f5, f6, f7, f8); 
 B = FOREACH A GENERATE f0, f1, f2, f3, f4;
 B1 = foreach B generate f0, f1, f2;
 C = GROUP B1 BY (f1, f2);
 STORE C into 'foo1';
 B2 = FOREACH B GENERATE f0, f3, f4;
 E = GROUP B2 BY (f3, f4);
 STORE E into 'foo2';
 F = FOREACH A GENERATE f0, f5, f6, f7, f8;
 F1 = FOREACH F GENERATE f0, f5,f6;
 G = GROUP F1 BY (f5, f6);
 STORE G into 'foo3';
 F2  = FOREACH F GENERATE f0, f7, f8;
 I = GROUP F2 BY (f7, f8);
 STORE I into 'foo4';
 {noformat}
 Exception encountered during execution:
 {noformat}
 java.lang.NullPointerException
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:262)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:209)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:186)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:277)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:268)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:142)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   >