Hi all, Below script is how we count distinct on columns jid and mid
sjv = LOAD '/path/of/the/data' USING AvroStorage(); jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time; groupv = GROUP jv ALL; countv = FOREACH groupv { unique = DISTINCT jv.jid_mid; GENERATE COUNT(unique); }; dump countv; The result is 2302351. If I use code below, got another result sjv = LOAD '/path/of/the/data' USING AvroStorage(); groupv = GROUP sjv ALL; countv = FOREACH groupv { jid_mid = sjv.(jid, mid); unique = DISTINCT jid_mid; GENERATE COUNT(unique); }; dump countv; The result is 2290003. If I concat the two columns with a delimiter never exists in jid and mid, I got another result which I think is the correct answer of this aggregation. sjv = LOAD '/path/of/the/data' USING AvroStorage(); jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time; groupv = GROUP jv ALL; countv = FOREACH groupv { unique = DISTINCT jv.jid_mid; GENERATE COUNT(unique); }; dump countv; The result is 2386385. I did a test with below script sjv = LOAD '/path/of/the/data' USING AvroStorage(); groupv = GROUP jv BY (jid, mid); unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time); store unique ....; The hadoop counters showed that there are 2386385 records written into HDFS. The number is as same as the 3rd pig script I list above. Can anyone explain the difference among those three? Whey they lead to different results? Regards, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com