Re: ORDER ... LIMIT failing on large data

Prashant Kommireddi Thu, 05 Jan 2012 15:14:44 -0800

William,

Can you please provide some more information on how many Maps/Reducers you
are using and the memory you have allocated to each task
(mapred.child.java.opts)?


-Prashant

On Thu, Jan 5, 2012 at 2:16 PM, <[email protected]> wrote:

> I have a small pig script that outputs the top 500 of a simple computed
> relation. It works fine on a small data set but fails on a larger (45 GB)
> data set. I don’t see errors in the hadoop logs (but I may be looking in
> the wrong places). On the large data set the pig log shows
>
> Input(s):
> Successfully read 1222894620 records (46581665598 bytes) from: "[...]"
>
> Output(s):
> Successfully stored 1 records (3 bytes) in: "hdfs://[...]"
>
> Counters:
> Total records written : 1
> Total bytes written : 3
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 4640
> Total records proactively spilled: 605383326
>
> On the small data set the pig log shows
>
> Input(s):
> Successfully read 188865 records (6749318 bytes) from: "[...]"
>
> Output(s):
> Successfully stored 500 records (5031 bytes) in: "hdfs://[...]"
>
> Counters:
> Total records written : 500
> Total bytes written : 5031
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> The script is
>
> cr = load 'data' as
>     (
>       citeddocid  : int,
>       citingdocid : int,
>     );
> CitedItemsGrpByDocId = group cr by citeddocid;
>
> DedupTCPerDocId =
>     foreach CitedItemsGrpByDocId {
>          CitingDocids =  cr.citingdocid;
>          UniqCitingDocids = distinct CitingDocids;
>          generate group, COUNT(UniqCitingDocids) as tc;
>     };
>
> DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC;
> DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500;
> store DedupTCPerDocIdSorted500 [...]
>
>
> I assume I am just doing something grossly inefficiently.  Can some one
> suggest a better way?  I’m using  Apache Pig version 0.8.1-cdh3u1
>
> Many thanks!
>
> Will
>
> William F Dowling
> Senior Technologist
>
> Thomson Reuters
>
>
>
>

Re: ORDER ... LIMIT failing on large data

Reply via email to