Replace join with custom implementation

Serega Sheypak Fri, 02 Aug 2013 02:30:31 -0700

Hi. I've met a problem wth replicated join in pig 0.11
I have two relations:
BIG (3-6GB) and SMALL (100MB)
I do join them on four integer fields.
It takes  up to 30 minutes to join them.


Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
32 cores on each TaskTracker.

So our hardware is really powerful.

I've ran a part of join locally and met terrible situation:
50% of heap:
is Integers,
arrays of integers these integers
and ArrayLists for arrays with integers.

GC overhead limit happens. The same happend on cluster. I did raise Xms,
Xms on cluster and problem is gone.

Anyway, joining 6GB/18 and 00Mb  for 30 minutes is toooooo much.
I would like to reiplement replicated join.
How can I do it?

Replace join with custom implementation

Reply via email to