Oh... sorry... I missed the part where you were saying that you want to
reimplement the replicated join algorithm


On Fri, Aug 2, 2013 at 9:13 AM, Pradeep Gollakota <[email protected]>wrote:

> join BIG by key, SMALL by key using 'replicated';
>
>
> On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak 
> <[email protected]>wrote:
>
>> Hi. I've met a problem wth replicated join in pig 0.11
>> I have two relations:
>> BIG (3-6GB) and SMALL (100MB)
>> I do join them on four integer fields.
>> It takes  up to 30 minutes to join them.
>>
>> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
>> 32 cores on each TaskTracker.
>>
>> So our hardware is really powerful.
>>
>> I've ran a part of join locally and met terrible situation:
>> 50% of heap:
>> is Integers,
>> arrays of integers these integers
>> and ArrayLists for arrays with integers.
>>
>> GC overhead limit happens. The same happend on cluster. I did raise Xms,
>> Xms on cluster and problem is gone.
>>
>> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is toooooo much.
>> I would like to reiplement replicated join.
>> How can I do it?
>>
>
>

Reply via email to