[ https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Bhat updated PIG-739: --------------------------- Attachment: filter_distinctbug.pig testdata Testdata and Pig script > Filter in foreach seems to drop records resulting in decreased count of > records > ------------------------------------------------------------------------------- > > Key: PIG-739 > URL: https://issues.apache.org/jira/browse/PIG-739 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.3.0 > Reporter: Viraj Bhat > Fix For: 0.3.0 > > Attachments: filter_distinctbug.pig, testdata > > > I have a Pig script in which I count the number of distinct records resulting > from the filter, this statement is embedded in a foreach. The number of > records I get with alias TESTDATA_AGG_2 is 1. > {code} > TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray, > testid:chararray, userid: chararray, sessionid:chararray, value:long, > flag:int); > TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and > timestamp lt '1230804000000' and value != 0); > TESTDATA_GROUP = group TESTDATA_FILTERED by testid; > TESTDATA_AGG = foreach TESTDATA_GROUP { > A = filter TESTDATA_FILTERED by (userid eq sessionid); > C = distinct A.userid; > generate group as testid, COUNT(TESTDATA_FILTERED) as > counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as > total_flags; > } > TESTDATA_AGG_1 = group TESTDATA_AGG ALL; > -- count records generated through nested foreach which contains distinct > TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG); > --explain TESTDATA_AGG_2; > dump TESTDATA_AGG_2; > --RESULT (1L) > {code} > But when I do the counting of records without the filter and distinct in the > foreach I get a different value (20L) > {code} > TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray, > testid:chararray, userid: chararray, sessionid:chararray, value:long, > flag:int); > TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and > timestamp lt '1230804000000' and value != 0); > TESTDATA_GROUP = group TESTDATA_FILTERED by testid; > -- count records generated through simple foreach > TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, > COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as > total_flags; > TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL; > TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2); > dump TESTDATA_AGG2_2; > --RESULT (20L) > {code} > Attaching testdata -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.