Just to be explicit:

This:

x = FILTER something by num1 > 10 AND num2 < 12;

is equivalent to this:

x = FILTER something by num1 > 10;
x = FILTER x by num2 < 12;

All non-blocking operators are evaluated in a streaming fashion, so you
don't need to worry about combining them into a single operator.

On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[email protected]>wrote:

> Hi Cameron,
>
> Your script looks alright. Each of your steps process data in different
> ways. Instead of cramming together them in a single statement (possibly via
> some custom UDF), it makes sense to have them in a series of steps as you
> have done for better readability and debuggability. Are you worried about
> performance? You need not to. As long as your operations don't introduce a
> unnecessary map-reduce boundary (which your script doesn't) you are good.
>
> Hope it helps,
> Ashutosh
>
> On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[email protected]>
> wrote:
>
> > Hey
> >
> > I am trying to extract performance metrics from some of my logs using Pig
> > and have come up with the following. I feel like I might be performing
> one
> > too many steps and was wondering if there is a way to reduce the number
> of
> > FILTER/FOREACH operations I need to run. Still trying to learn the proper
> > syntax.
> >
> > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> > body:CHARARRAY;
> > metricLogLine = FILTER uniqLogs BY (body MATCHES
> > '.*gr.perf.metrics.Category.*');
> > metricLogData = FOREACH metricLogLine GENERATE host,
> > REGEX_EXTRACT_ALL(body,
> >
> >
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> > AS regex;
> > fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> > eventCategories = FOREACH fltrdMetricLogData GENERATE host,
> FLATTEN(regex)
> > AS (category:CHARARRAY, event:CHARARRAY);
> >
> > Thanks
> >
>

Reply via email to