[ 
https://issues.apache.org/jira/browse/PIG-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1169:
---------------------------------

    Assignee: Richard Ding

> Problems with some top N queries
> --------------------------------
>
>                 Key: PIG-1169
>                 URL: https://issues.apache.org/jira/browse/PIG-1169
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>
> Recently, a couple of problems related to the Top N queries were reported by 
> users.
> * From Chuang Liu:
> We tried to get top N results after a groupby and sort, and got different 
> results with or without storing the full sorted results. Here is a skeleton 
> of our pig script.
> {code}
> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
> grouped = group raw_data by (f1, f2);
> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
> ordered = order data by value DESC parallel 10;
> topn = limit ordered 10;
> store ordered into 'outputdir/full';
> store topn into 'outputdir/topn';
> {code}
> With the statement 'store ordered ...', top N results are incorrect, but 
> without the statement, results are correct. Has anyone seen this before? I 
> know a similar bug has been fixed in the multi-query release. We are on pig
> .4 and hadoop .20.1.
> * From Corry Haines:
> I am not sure if this is a bug, or something more subtle, but here is the 
> problem that I am having.
> When I LOAD a dataset, change it with an ORDER, LIMIT it, then CROSS it with 
> itself, the results are not correct. I expect to see the cross of the 
> limited, ordered dataset, but instead I see the cross of the limited dataset. 
> Effectively, its like the LIMIT is being excluded.
> Pig Version: 0.5.0
> Hadoop Version: 0.20.1
> I would greatly appreciate some help, as this is somewhat frustrating.
> Example code (and output) follows:
> {code}
> A = load 'foo' as (f1:int, f2:int, f3:int); B = load 'foo' as (f1:int, 
> f2:int, f3:int);
> a = ORDER A BY f1 DESC;
> b = ORDER B BY f1 DESC;
> aa = LIMIT a 1;
> bb = LIMIT b 1;
> C = CROSS aa, bb;
> DUMP C;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to