Nested distincts are dangerous. They are not done in a distributed fashion, they have to be loaded into memory. So that is what is killing it, not the order/limit.
The alternative is to do two groups, first group by (citeddocid,CitingDocids) to get the distinct and then by citeddocid. to get the count 2012/1/5 <[email protected]> > I have a small pig script that outputs the top 500 of a simple computed > relation. It works fine on a small data set but fails on a larger (45 GB) > data set. I don’t see errors in the hadoop logs (but I may be looking in > the wrong places). On the large data set the pig log shows > > Input(s): > Successfully read 1222894620 records (46581665598 bytes) from: "[...]" > > Output(s): > Successfully stored 1 records (3 bytes) in: "hdfs://[...]" > > Counters: > Total records written : 1 > Total bytes written : 3 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 4640 > Total records proactively spilled: 605383326 > > On the small data set the pig log shows > > Input(s): > Successfully read 188865 records (6749318 bytes) from: "[...]" > > Output(s): > Successfully stored 500 records (5031 bytes) in: "hdfs://[...]" > > Counters: > Total records written : 500 > Total bytes written : 5031 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > The script is > > cr = load 'data' as > ( > citeddocid : int, > citingdocid : int, > ); > CitedItemsGrpByDocId = group cr by citeddocid; > > DedupTCPerDocId = > foreach CitedItemsGrpByDocId { > CitingDocids = cr.citingdocid; > UniqCitingDocids = distinct CitingDocids; > generate group, COUNT(UniqCitingDocids) as tc; > }; > > DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC; > DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500; > store DedupTCPerDocIdSorted500 [...] > > > I assume I am just doing something grossly inefficiently. Can some one > suggest a better way? I’m using Apache Pig version 0.8.1-cdh3u1 > > Many thanks! > > Will > > William F Dowling > Senior Technologist > > Thomson Reuters > > > >
