Re: Replace join with custom implementation

Pradeep Gollakota Fri, 02 Aug 2013 06:14:43 -0700

join BIG by key, SMALL by key using 'replicated';


On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <[email protected]>wrote:

> Hi. I've met a problem wth replicated join in pig 0.11
> I have two relations:
> BIG (3-6GB) and SMALL (100MB)
> I do join them on four integer fields.
> It takes  up to 30 minutes to join them.
>
> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
> 32 cores on each TaskTracker.
>
> So our hardware is really powerful.
>
> I've ran a part of join locally and met terrible situation:
> 50% of heap:
> is Integers,
> arrays of integers these integers
> and ArrayLists for arrays with integers.
>
> GC overhead limit happens. The same happend on cluster. I did raise Xms,
> Xms on cluster and problem is gone.
>
> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is toooooo much.
> I would like to reiplement replicated join.
> How can I do it?
>

Re: Replace join with custom implementation

Reply via email to