Filter in foreach seems to drop records resulting in decreased count of records
-------------------------------------------------------------------------------

                 Key: PIG-739
                 URL: https://issues.apache.org/jira/browse/PIG-739
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.3.0
            Reporter: Viraj Bhat
             Fix For: 0.3.0


I have a Pig script in which I count the number of distinct records resulting 
from the filter, this statement is embedded in a foreach. The number of records 
I get with alias  TESTDATA_AGG_2 is 1.

{code}
TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, 
testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);

TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and 
timestamp lt '1230804000000' and value != 0);

TESTDATA_GROUP = group TESTDATA_FILTERED by testid;

TESTDATA_AGG = foreach TESTDATA_GROUP {
                        A = filter TESTDATA_FILTERED by (userid eq sessionid);
                        C = distinct A.userid;
                        generate group as testid, COUNT(TESTDATA_FILTERED) as 
counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as 
total_flags;
                }

TESTDATA_AGG_1 = group TESTDATA_AGG ALL;

-- count records generated through nested foreach which contains distinct
TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);

--explain TESTDATA_AGG_2;
dump TESTDATA_AGG_2;
--RESULT (1L)
{code}

But when I do the counting of records without the filter and distinct in the 
foreach I get a different value (20L)

{code}

TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, 
testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);

TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and 
timestamp lt '1230804000000' and value != 0);

TESTDATA_GROUP = group TESTDATA_FILTERED by testid;

-- count records generated through simple foreach
TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, 
COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as 
total_flags;

TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
dump TESTDATA_AGG2_2;
--RESULT (20L)
{code}

Attaching testdata

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to