Thanks for the explanation, Marcos!
On Thu, Mar 28, 2013 at 3:08 AM, MARCOS MEDRADO RUBINELLI < [email protected]> wrote: > > Hi, Mix: > " second map reduce started executing before first one got completed" > Interesting. Since you just do LOAD for evnt_dtl, without DUMP or STORE it, > Pig shouldn't do anything, especially before STORE command complete. > > I have below script and it works fine. So think root cause is something > else. Unless your data is very big? > a = load 'words_and_numbers' as (f1:chararray, f2:chararray); > b = filter a by f1 is not null; > store (foreach (group b all) generate flatten($1)) into 'multipleload/tmp'; > c = load 'multipleload/tmp/part-r-00000' as (f3:chararray, f4:chararray); > dump c; > > Johnny > > > > It's the multi-query execution optimization. Pig doesn't know it should > wait for the STORE before the second LOAD, so it tries to run it in > parallel. You have three options: > > 1. Name the relation you stored and use it instead of loading a new > relation: > > Data = LOAD '/....' as (,,,, ) > NoNullData= FILTER Data by qe is not null; > exp = foreach (group NoNullData all) generate flatten($1); > STORE exp into 'exp/$inputDatePig'; > > evnt_dtl = FOREACH exp GENERATE $0 as cust ... > > 2. Use the EXEC keyword to tell Pig to finish the commands up to that > point before running the rest: > > Data = LOAD '/....' as (,,,, ) > NoNullData= FILTER Data by qe is not null; > STORE (foreach (group NoNullData all) generate flatten($1)) into > 'exp/$inputDatePig'; > EXEC; > evnt_dtl =LOAD 'exp/$inputDatePig/part-r-00000' AS (cust,,,,,) > > 3. Disable multi-query execution: > $ pig -no_multiquery x.pig > > > - Marcos >
