Re: Join,Filter on the same line and optimization

Dmitriy Ryaboy Thu, 12 Apr 2012 10:03:58 -0700

Mohit,
Pig does not run (or even consider running) a separate MR job for
every line in the pig script.
If you do a join and then filter, the filter will happen in the same
MR job as the join, since filtering records does not require any
shuffling.

Now, there is a problem in the code you have, and it comes from the
fact that unlike in SQL, in Pig Latin you have access to the original
fields in joined relation, so "FORM_ID" is ill-defined after the join.
You want to specify either A::FORM_ID or B::FORM_ID. Also note that it
would be more efficient to put these filters prior to the join and
avoid moving those records in the MR shuffle. Pig may be smart enough
to automatically do this for you, but it's better to just write the
more efficient code to begin with.

Lastly, Pig doesn't actually force you to write code in all caps :).
It just looks like it might. You can use lower-case relation names,
keywords, etc. UDFs invocations do have to match their class names.

D

On Wed, Apr 11, 2012 at 3:39 PM, Mohit Anchlia <[email protected]> wrote:
> Is it possible to say something like
>
>
> F = JOIN A BY (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), B BY
> (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT) AND FILTER A BY FORM_ID == 0;
>
> Also, how far does pig go in optimizing the job if I do specify the line
> above for instance as:
>
> F = JOIN A BY (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT), B BY
> (FILE_NAME,CREATED_DATE,FORM_ID,FORM_ID_ROOT)
>
> G = FILTER F BY FORM_ID == 0;
>
> Would pig run only one reduce job or multiple in the case above?

Re: Join,Filter on the same line and optimization

Reply via email to