Oh... sorry... I missed the part where you were saying that you want to reimplement the replicated join algorithm
On Fri, Aug 2, 2013 at 9:13 AM, Pradeep Gollakota <[email protected]>wrote: > join BIG by key, SMALL by key using 'replicated'; > > > On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak > <[email protected]>wrote: > >> Hi. I've met a problem wth replicated join in pig 0.11 >> I have two relations: >> BIG (3-6GB) and SMALL (100MB) >> I do join them on four integer fields. >> It takes up to 30 minutes to join them. >> >> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total >> 32 cores on each TaskTracker. >> >> So our hardware is really powerful. >> >> I've ran a part of join locally and met terrible situation: >> 50% of heap: >> is Integers, >> arrays of integers these integers >> and ArrayLists for arrays with integers. >> >> GC overhead limit happens. The same happend on cluster. I did raise Xms, >> Xms on cluster and problem is gone. >> >> Anyway, joining 6GB/18 and 00Mb for 30 minutes is toooooo much. >> I would like to reiplement replicated join. >> How can I do it? >> > >
