Turning off the combiner is not a great thing to do, in general. It will destroy the performance of Algebraic things like COUNTs, etc.
Glad to help! 2012/1/6 <[email protected]> > Thanks Jonathan and Prashant. The immediate cause of the problem I had > (failing without erroring out) was slightly different formatting between > the small and large input sets. Duh. > > When I fixed that, I did indeed get OOM due to the nested distinct. I > tried the workaround you suggested Jonathan using two groups, and it worked > great! > > In a separate run I also tried > SET pig.exec.nocombiner true; > and found that worked also, and the runtime was the same as using the two > group circumlocution. > > Thanks again for your help. > > Will > > > William F Dowling > Senior Technologist > Thomson Reuters > > > -----Original Message----- > From: Jonathan Coveney [mailto:[email protected]] > Sent: Thursday, January 05, 2012 5:51 PM > To: [email protected] > Subject: Re: ORDER ... LIMIT failing on large data > > Nested distincts are dangerous. They are not done in a distributed fashion, > they have to be loaded into memory. So that is what is killing it, not the > order/limit. > > The alternative is to do two groups, first group by > (citeddocid,CitingDocids) to get the distinct and then by citeddocid. to > get the count > > 2012/1/5 <[email protected]> > > > I have a small pig script that outputs the top 500 of a simple computed > > relation. It works fine on a small data set but fails on a larger (45 GB) > > data set. I don’t see errors in the hadoop logs (but I may be looking in > > the wrong places). On the large data set the pig log shows > > > > Input(s): > > Successfully read 1222894620 records (46581665598 bytes) from: "[...]" > > > > Output(s): > > Successfully stored 1 records (3 bytes) in: "hdfs://[...]" > > > > Counters: > > Total records written : 1 > > Total bytes written : 3 > > Spillable Memory Manager spill count : 0 > > Total bags proactively spilled: 4640 > > Total records proactively spilled: 605383326 > > > > On the small data set the pig log shows > > > > Input(s): > > Successfully read 188865 records (6749318 bytes) from: "[...]" > > > > Output(s): > > Successfully stored 500 records (5031 bytes) in: "hdfs://[...]" > > > > Counters: > > Total records written : 500 > > Total bytes written : 5031 > > Spillable Memory Manager spill count : 0 > > Total bags proactively spilled: 0 > > Total records proactively spilled: 0 > > > > The script is > > > > cr = load 'data' as > > ( > > citeddocid : int, > > citingdocid : int, > > ); > > CitedItemsGrpByDocId = group cr by citeddocid; > > > > DedupTCPerDocId = > > foreach CitedItemsGrpByDocId { > > CitingDocids = cr.citingdocid; > > UniqCitingDocids = distinct CitingDocids; > > generate group, COUNT(UniqCitingDocids) as tc; > > }; > > > > DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC; > > DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500; > > store DedupTCPerDocIdSorted500 [...] > > > > > > I assume I am just doing something grossly inefficiently. Can some one > > suggest a better way? I’m using Apache Pig version 0.8.1-cdh3u1 > > > > Many thanks! > > > > Will > > > > William F Dowling > > Senior Technologist > > > > Thomson Reuters > > > > > > > > >
