Turning off the combiner is not a great thing to do, in general. It will
destroy the performance of Algebraic things like COUNTs, etc.

Glad to help!

2012/1/6 <[email protected]>

> Thanks Jonathan and Prashant. The immediate cause of the problem I had
> (failing without erroring out) was slightly different formatting between
> the small and large input sets. Duh.
>
> When I fixed that, I did indeed get OOM due to the nested distinct. I
> tried the workaround you suggested Jonathan using two groups, and it worked
> great!
>
> In a separate run I also tried
>  SET pig.exec.nocombiner true;
> and found that worked also, and the runtime was the same as using the two
> group circumlocution.
>
> Thanks again for your help.
>
> Will
>
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>
>
> -----Original Message-----
> From: Jonathan Coveney [mailto:[email protected]]
> Sent: Thursday, January 05, 2012 5:51 PM
> To: [email protected]
> Subject: Re: ORDER ... LIMIT failing on large data
>
> Nested distincts are dangerous. They are not done in a distributed fashion,
> they have to be loaded into memory. So that is what is killing it, not the
> order/limit.
>
> The alternative is to do two groups, first group by
> (citeddocid,CitingDocids) to get the distinct and then by citeddocid. to
> get the count
>
> 2012/1/5 <[email protected]>
>
> > I have a small pig script that outputs the top 500 of a simple computed
> > relation. It works fine on a small data set but fails on a larger (45 GB)
> > data set. I don’t see errors in the hadoop logs (but I may be looking in
> > the wrong places). On the large data set the pig log shows
> >
> > Input(s):
> > Successfully read 1222894620 records (46581665598 bytes) from: "[...]"
> >
> > Output(s):
> > Successfully stored 1 records (3 bytes) in: "hdfs://[...]"
> >
> > Counters:
> > Total records written : 1
> > Total bytes written : 3
> > Spillable Memory Manager spill count : 0
> > Total bags proactively spilled: 4640
> > Total records proactively spilled: 605383326
> >
> > On the small data set the pig log shows
> >
> > Input(s):
> > Successfully read 188865 records (6749318 bytes) from: "[...]"
> >
> > Output(s):
> > Successfully stored 500 records (5031 bytes) in: "hdfs://[...]"
> >
> > Counters:
> > Total records written : 500
> > Total bytes written : 5031
> > Spillable Memory Manager spill count : 0
> > Total bags proactively spilled: 0
> > Total records proactively spilled: 0
> >
> > The script is
> >
> > cr = load 'data' as
> >     (
> >       citeddocid  : int,
> >       citingdocid : int,
> >     );
> > CitedItemsGrpByDocId = group cr by citeddocid;
> >
> > DedupTCPerDocId =
> >     foreach CitedItemsGrpByDocId {
> >          CitingDocids =  cr.citingdocid;
> >          UniqCitingDocids = distinct CitingDocids;
> >          generate group, COUNT(UniqCitingDocids) as tc;
> >     };
> >
> > DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC;
> > DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500;
> > store DedupTCPerDocIdSorted500 [...]
> >
> >
> > I assume I am just doing something grossly inefficiently.  Can some one
> > suggest a better way?  I’m using  Apache Pig version 0.8.1-cdh3u1
> >
> > Many thanks!
> >
> > Will
> >
> > William F Dowling
> > Senior Technologist
> >
> > Thomson Reuters
> >
> >
> >
> >
>

Reply via email to