Use the ILLUSTRATE or EXPLAIN keywords to look at the details of the physical execution plan... from first glance it doesn't look like you'd need a 2nd job to do the joins, but if you can post the output of ILLUSTRATE/EXPLAIN, we can look into it.
On Mon, Nov 11, 2013 at 4:36 PM, Dexin Wang <wangde...@gmail.com> wrote: > Hi, > > I'm running a job like this: > > raw_large = LOAD 'lots_of_files' AS (...); > raw_filtered = FILTER raw_large BY ...; > large_table = FOREACH raw_filtered GENERATE f1, f2, f3,....; > > joined_1 = JOIN large_table BY (key1) LEFT, config_table_1 BY (key2) USING > 'replicated'; > joined_2 = JOIN join1 BY (key3) LEFT, config_table_2 BY (key4) > USING 'replicated'; > joined_3 = JOIN join2 BY (key5) LEFT, config_table_3 BY (key6) > USING 'replicated'; > joined_4 = JOIN join4 BY (key7) LEFT, config_table_3 BY (key8) > USING 'replicated'; > > basically left join a large table with 4 relatively small tables using the > replicated join. > > I see a first load job has 120 mapper tasks and no reducer, and this job > seems to be doing the load and filtering. And there is another job > following that has 26 mapper tasks that seem to be doing the joins. > > Shouldn't there be only one job and the joins being done in the mapper > phase of the first job? > > The 4 config tables (files) have these sizes respectively: > > 3MB > 220kB > 2kB > 100kB > > these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB > memory. > > Thanks! >