Let's just say it's overly optimistic w.r.t. what actually takes time in a
pig job.

D

On Wed, Nov 2, 2011 at 1:45 PM, Cameron Gandevia <[email protected]>wrote:

> In the pig documentation there is a section title Reduce your operator
> pipeline which talks about combining foreach statements as an optimization.
> It also mentions you should do the same for filter statements. Is this
> incorrect?
>
> On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <[email protected]
> >wrote:
>
> > Cool thanks
> >
> >
> > On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> >
> >> Just to be explicit:
> >>
> >> This:
> >>
> >> x = FILTER something by num1 > 10 AND num2 < 12;
> >>
> >> is equivalent to this:
> >>
> >> x = FILTER something by num1 > 10;
> >> x = FILTER x by num2 < 12;
> >>
> >> All non-blocking operators are evaluated in a streaming fashion, so you
> >> don't need to worry about combining them into a single operator.
> >>
> >> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[email protected]
> >> >wrote:
> >>
> >> > Hi Cameron,
> >> >
> >> > Your script looks alright. Each of your steps process data in
> different
> >> > ways. Instead of cramming together them in a single statement
> (possibly
> >> via
> >> > some custom UDF), it makes sense to have them in a series of steps as
> >> you
> >> > have done for better readability and debuggability. Are you worried
> >> about
> >> > performance? You need not to. As long as your operations don't
> >> introduce a
> >> > unnecessary map-reduce boundary (which your script doesn't) you are
> >> good.
> >> >
> >> > Hope it helps,
> >> > Ashutosh
> >> >
> >> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[email protected]>
> >> > wrote:
> >> >
> >> > > Hey
> >> > >
> >> > > I am trying to extract performance metrics from some of my logs
> using
> >> Pig
> >> > > and have come up with the following. I feel like I might be
> performing
> >> > one
> >> > > too many steps and was wondering if there is a way to reduce the
> >> number
> >> > of
> >> > > FILTER/FOREACH operations I need to run. Still trying to learn the
> >> proper
> >> > > syntax.
> >> > >
> >> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> >> > > body:CHARARRAY;
> >> > > metricLogLine = FILTER uniqLogs BY (body MATCHES
> >> > > '.*gr.perf.metrics.Category.*');
> >> > > metricLogData = FOREACH metricLogLine GENERATE host,
> >> > > REGEX_EXTRACT_ALL(body,
> >> > >
> >> > >
> >> >
> >>
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> >> > > AS regex;
> >> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> >> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host,
> >> > FLATTEN(regex)
> >> > > AS (category:CHARARRAY, event:CHARARRAY);
> >> > >
> >> > > Thanks
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > Thanks
> >
> > Cameron Gandevia
> >
>
>
>
> --
> Thanks
>
> Cameron Gandevia
>

Reply via email to