join BIG by key, SMALL by key using 'replicated';
On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <[email protected]>wrote: > Hi. I've met a problem wth replicated join in pig 0.11 > I have two relations: > BIG (3-6GB) and SMALL (100MB) > I do join them on four integer fields. > It takes up to 30 minutes to join them. > > Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total > 32 cores on each TaskTracker. > > So our hardware is really powerful. > > I've ran a part of join locally and met terrible situation: > 50% of heap: > is Integers, > arrays of integers these integers > and ArrayLists for arrays with integers. > > GC overhead limit happens. The same happend on cluster. I did raise Xms, > Xms on cluster and problem is gone. > > Anyway, joining 6GB/18 and 00Mb for 30 minutes is toooooo much. > I would like to reiplement replicated join. > How can I do it? >
