Failure in Hadoop map collect stage due to type mismatch in the keys used in 
cogroup
------------------------------------------------------------------------------------

                 Key: PIG-537
                 URL: https://issues.apache.org/jira/browse/PIG-537
             Project: Pig
          Issue Type: Bug
    Affects Versions: types_branch
            Reporter: Viraj Bhat
            Priority: Critical
             Fix For: types_branch


Consider the following pig query, which demonstrates various problems during 
the Logical Plan creation and the subsequent execution of the M/R job. In this 
query we do two cogroups, one between A and B to generate an alias ABtemptable. 
Then we again cogroup A with ABtemptable based on marks which was read in as an 
int. 
==================================================================================
{code}
A = load 'mymarks.txt' as (username:chararray,marks:int);
B = load 'mygrades.txt' as (username:chararray,grade:chararray);
ABtemp = cogroup A by username, B  by username;
ABtemptable = foreach ABtemp generate
           group as username,
           flatten(A.marks) as newmarks;
--describe ABtemptable;
C = cogroup A by marks, ABtemptable by newmarks;
--describe C;
explain C;
dump C;
{code}
==================================================================================
The schema for C and ABtemptable which pig reports:
==================================================================================
{code}describe ABtemptable{code} = ABtemptable: {username: chararray,newmarks: 
int}
{code}describe C{code}  = C: {group: int,A: {username: chararray,marks: 
int},ABtemptable: {username: chararray,newmarks: int}}
==================================================================================
If you run the above query you get the following error:
==================================================================================
2008-11-18 03:57:14,372 [main] ERROR 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error 
message from task (map) task_200810152105_0156_m_000000java.io.IOException: 
Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, 
recieved org.apache.pig.impl.io.NullableIntWritable
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:97)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:172)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:158)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:82)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
==================================================================================
Looking at the {code}explain C{code} output, you see that newmarks has become a 
chararray (surprising!!)
==================================================================================
|---CoGroup viraj-Tue Nov 18 03:49:42 UTC 2008-25 Schema: {group: 
Unknown,{username: bytearray,marks: int},ABtemptable: {username: 
chararray,newmarks: chararray}} Type: bag
    |   |
    |   Project viraj-Tue Nov 18 03:49:42 UTC 2008-23 Projections: [1] 
Overloaded: false FieldSchema: marks: int Type: int
    |   Input: SplitOutput[null] viraj-Tue Nov 18 03:49:42 UTC 2008-29
    |   |
    |   Project viraj-Tue Nov 18 03:49:42 UTC 2008-24 Projections: [1] 
Overloaded: false FieldSchema: newmarks: chararray Type: chararray
    |   Input: ForEach viraj-Tue Nov 18 03:49:42 UTC 2008-22
    |---ForEach viraj-Tue Nov 18 03:49:42 UTC 2008-22 Schema: {username: 
chararray,newmarks: chararray} Type: bag
==================================================================================
In Summary this script demonstrates the following problems:
1) Logical Plan creation
2) When cogrouping with fields of different types which results in group 
unknown is not caught during compile phase.
Additionally I am enclosing the explain output of alias C and testfiles to run 
the script which is on this jira!!
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to