[ https://issues.apache.org/jira/browse/PIG-665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673079#action_12673079 ]
Olga Natkovich commented on PIG-665: ------------------------------------ +1; please, commit > Map key type not correctly set (for use when key is null) when map plan does > not have localrearrange > ---------------------------------------------------------------------------------------------------- > > Key: PIG-665 > URL: https://issues.apache.org/jira/browse/PIG-665 > Project: Pig > Issue Type: Bug > Affects Versions: types_branch > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Fix For: types_branch > > Attachments: PIG-665.patch > > > KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the > map key. This is required so that when the map key is null, we can still > construct a valid NullableXXXWritable object to pass on to hadoop in the > collect() call (hadoop needs a valid object even for null objects). Currently > the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to > figure out the key type. In a pig script which results in multiple Map reduce > jobs, one of the jobs could have a map plan with only POLoads in it. In such > a case, the map key type is not discovered and this results in a null being > returned from HDataType.getWritableComparableTypes() method. This in turn > will result in a NullPointerException in the collect(). > Here is a script which can prompt this behavior: > {code} > a = load 'a.txt' as (x:int, y:int, z:int); > b = load 'b.txt' as (x:int, y:int); > b_group = group b by x; > b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks; > a_group = group a by (x, y); > a_aggs = foreach a_group { > generate > flatten(group) as (x, y), > SUM(a.z) as zs; > }; > join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will > only have two POLoads which will result in the NullPointerException at > runtime in collect() > dump join_a_b; > {code} > Contents of a.txt (columns are tab separated): > The first column of the first two rows is null (represented by an empty > column) > {noformat} > 7 8 > 8 9 > 1 20 30 > 1 20 40 > {noformat} > Contents of b.txt (columns are tab separated): > {noformat} > 7 2 > 1 5 > 1 10 > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.