Hi. I've met a problem wth replicated join in pig 0.11 I have two relations: BIG (3-6GB) and SMALL (100MB) I do join them on four integer fields. It takes up to 30 minutes to join them.
Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total 32 cores on each TaskTracker. So our hardware is really powerful. I've ran a part of join locally and met terrible situation: 50% of heap: is Integers, arrays of integers these integers and ArrayLists for arrays with integers. GC overhead limit happens. The same happend on cluster. I did raise Xms, Xms on cluster and problem is gone. Anyway, joining 6GB/18 and 00Mb for 30 minutes is toooooo much. I would like to reiplement replicated join. How can I do it?
