Mohit, Pig does not run (or even consider running) a separate MR job for every line in the pig script. If you do a join and then filter, the filter will happen in the same MR job as the join, since filtering records does not require any shuffling.
Now, there is a problem in the code you have, and it comes from the fact that unlike in SQL, in Pig Latin you have access to the original fields in joined relation, so "FORM_ID" is ill-defined after the join. You want to specify either A::FORM_ID or B::FORM_ID. Also note that it would be more efficient to put these filters prior to the join and avoid moving those records in the MR shuffle. Pig may be smart enough to automatically do this for you, but it's better to just write the more efficient code to begin with. Lastly, Pig doesn't actually force you to write code in all caps :). It just looks like it might. You can use lower-case relation names, keywords, etc. UDFs invocations do have to match their class names. D On Wed, Apr 11, 2012 at 3:39 PM, Mohit Anchlia <[email protected]> wrote: > Is it possible to say something like > > > F = JOIN A BY (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), B BY > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT) AND FILTER A BY FORM_ID == 0; > > Also, how far does pig go in optimizing the job if I do specify the line > above for instance as: > > F = JOIN A BY (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), B BY > (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT) > > G = FILTER F BY FORM_ID == 0; > > Would pig run only one reduce job or multiple in the case above?
