I have seen this happen when there are very large number of distinct values for a set of group keys. When combiner gets used, input records for reduce task already has partial distinct bags, and this can result in large records which cause MR to run out of memory trying to load the records.
You can modify the query the way its mentioned in comemnt#1 in - https://issues.apache.org/jira/browse/PIG-1846 Or you can adding following to your script to disable combiner - set pig.exec.nocombiner true; Thanks, Thejas On 6/10/11 11:15 AM, "[email protected]" <[email protected]> wrote: > I have a pig script that is working well for small test data sets but fails on > a run over realistic-sized data. Logs show > INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - job job_201106061024_0331 has failed! > S > job_201106061024_0331 CitedItemsGrpByDocId,DedupTCPerDocId > GROUP_BY,COMBINER Message: Job failed! > S > attempt_201106061024_0331_m_000198_0 [S] Error: > java.lang.OutOfMemoryError: Java heap space > and similar same for all attempts at a few of the other (many) map tasks for > this job. > > I believe this job corresponds to these lines in my pig script: > > CitedItemsGrpByDocId = group CitedItems by citeddocid; > DedupTCPerDocId = > foreach CitedItemsGrpByDocId { > CitingDocids = CitedItems.citingdocid; > UniqCitingDocids = distinct CitingDocids; > generate group, COUNT(UniqCitingDocids) as tc; > }; > > I tried increasing mapred.child.java.opts but the job failed in a setup stage > with > Error occurred during initialization of VM > Could not reserve enough space for object heap > > Are there job configurations/parameters for Hadoop or pig I can set to get > around this? Is there a Pig Latin circumlocution, or better way to express > what I want, that is not as memory-hungry? > > Thank in advance, > > Will > > William F Dowling > Sr Technical Specialist, Software Engineering > > > --
