William, Can you please provide some more information on how many Maps/Reducers you are using and the memory you have allocated to each task (mapred.child.java.opts)?
-Prashant On Thu, Jan 5, 2012 at 2:16 PM, <[email protected]> wrote: > I have a small pig script that outputs the top 500 of a simple computed > relation. It works fine on a small data set but fails on a larger (45 GB) > data set. I don’t see errors in the hadoop logs (but I may be looking in > the wrong places). On the large data set the pig log shows > > Input(s): > Successfully read 1222894620 records (46581665598 bytes) from: "[...]" > > Output(s): > Successfully stored 1 records (3 bytes) in: "hdfs://[...]" > > Counters: > Total records written : 1 > Total bytes written : 3 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 4640 > Total records proactively spilled: 605383326 > > On the small data set the pig log shows > > Input(s): > Successfully read 188865 records (6749318 bytes) from: "[...]" > > Output(s): > Successfully stored 500 records (5031 bytes) in: "hdfs://[...]" > > Counters: > Total records written : 500 > Total bytes written : 5031 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > The script is > > cr = load 'data' as > ( > citeddocid : int, > citingdocid : int, > ); > CitedItemsGrpByDocId = group cr by citeddocid; > > DedupTCPerDocId = > foreach CitedItemsGrpByDocId { > CitingDocids = cr.citingdocid; > UniqCitingDocids = distinct CitingDocids; > generate group, COUNT(UniqCitingDocids) as tc; > }; > > DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC; > DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500; > store DedupTCPerDocIdSorted500 [...] > > > I assume I am just doing something grossly inefficiently. Can some one > suggest a better way? I’m using Apache Pig version 0.8.1-cdh3u1 > > Many thanks! > > Will > > William F Dowling > Senior Technologist > > Thomson Reuters > > > >
