[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1437: Assignee: Xuefu Zhang Fix Version/s: 0.9.0 [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct - Key: PIG-1437 URL: https://issues.apache.org/jira/browse/PIG-1437 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Xuefu Zhang Priority: Minor Fix For: 0.9.0 Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1437: Parent: PIG-1319 Issue Type: Sub-task (was: Bug) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct - Key: PIG-1437 URL: https://issues.apache.org/jira/browse/PIG-1437 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1437: -- Release Note: (was: Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. ) Description: Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct - Key: PIG-1437 URL: https://issues.apache.org/jira/browse/PIG-1437 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Priority: Minor Its possible to rewrite queries like this {code} A = load 'data' as (name,age); B = group A by (name,age); C = foreach B generate group.name, group.age; dump C; {code} or {code} (name,age); B = group A by (name A = load 'data' as,age); C = foreach B generate flatten(group); dump C; {code} to {code} A = load 'data' as (name,age); B = distinct A; dump B; {code} This could only be done if no columns within the bags are referenced subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.