Great question. There seems to be some confusion about how DISTINCT operates. I remembered (and thankfully found) this message<http://mail-archives.apache.org/mod_mbox/pig-user/201309.mbox/%3CCAE7pYjar3hX4Kp%2B5SQz3sr%3DvjxfQDVq_6Yi4vh9KgfOj3dzTGw%40mail.gmail.com%3E> that explains the behavior.
As per the other post, it looks like what you've documented is expected behavior. On Mon, Oct 28, 2013 at 4:15 PM, Min Zhou <coderp...@gmail.com> wrote: > Hi all, > > Below script is how we count distinct on columns jid and mid > > sjv = LOAD '/path/of/the/data' USING AvroStorage(); > jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time; > groupv = GROUP jv ALL; > countv = FOREACH groupv { > unique = DISTINCT jv.jid_mid; > GENERATE COUNT(unique); > }; > dump countv; > The result is 2302351. > > If I use code below, got another result > sjv = LOAD '/path/of/the/data' USING AvroStorage(); > groupv = GROUP sjv ALL; > countv = FOREACH groupv { > jid_mid = sjv.(jid, mid); > unique = DISTINCT jid_mid; > GENERATE COUNT(unique); > }; > dump countv; > The result is 2290003. > > If I concat the two columns with a delimiter never exists in jid and mid, I > got another result which I think is the correct answer of this aggregation. > sjv = LOAD '/path/of/the/data' USING AvroStorage(); > jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time; > groupv = GROUP jv ALL; > countv = FOREACH groupv { > unique = DISTINCT jv.jid_mid; > GENERATE COUNT(unique); > }; > dump countv; > The result is 2386385. > > I did a test with below script > sjv = LOAD '/path/of/the/data' USING AvroStorage(); > groupv = GROUP jv BY (jid, mid); > unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time); > store unique ....; > The hadoop counters showed that there are 2386385 records written into > HDFS. The number is as same as the 3rd pig script I list above. > > Can anyone explain the difference among those three? Whey they lead to > different results? > > Regards, > Min > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > My profile: > http://www.linkedin.com/in/coderplay > My blog: > http://coderplay.javaeye.com >