Let's just say it's overly optimistic w.r.t. what actually takes time in a pig job.
D On Wed, Nov 2, 2011 at 1:45 PM, Cameron Gandevia <[email protected]>wrote: > In the pig documentation there is a section title Reduce your operator > pipeline which talks about combining foreach statements as an optimization. > It also mentions you should do the same for filter statements. Is this > incorrect? > > On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <[email protected] > >wrote: > > > Cool thanks > > > > > > On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > > >> Just to be explicit: > >> > >> This: > >> > >> x = FILTER something by num1 > 10 AND num2 < 12; > >> > >> is equivalent to this: > >> > >> x = FILTER something by num1 > 10; > >> x = FILTER x by num2 < 12; > >> > >> All non-blocking operators are evaluated in a streaming fashion, so you > >> don't need to worry about combining them into a single operator. > >> > >> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[email protected] > >> >wrote: > >> > >> > Hi Cameron, > >> > > >> > Your script looks alright. Each of your steps process data in > different > >> > ways. Instead of cramming together them in a single statement > (possibly > >> via > >> > some custom UDF), it makes sense to have them in a series of steps as > >> you > >> > have done for better readability and debuggability. Are you worried > >> about > >> > performance? You need not to. As long as your operations don't > >> introduce a > >> > unnecessary map-reduce boundary (which your script doesn't) you are > >> good. > >> > > >> > Hope it helps, > >> > Ashutosh > >> > > >> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[email protected]> > >> > wrote: > >> > > >> > > Hey > >> > > > >> > > I am trying to extract performance metrics from some of my logs > using > >> Pig > >> > > and have come up with the following. I feel like I might be > performing > >> > one > >> > > too many steps and was wondering if there is a way to reduce the > >> number > >> > of > >> > > FILTER/FOREACH operations I need to run. Still trying to learn the > >> proper > >> > > syntax. > >> > > > >> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > >> > > body:CHARARRAY; > >> > > metricLogLine = FILTER uniqLogs BY (body MATCHES > >> > > '.*gr.perf.metrics.Category.*'); > >> > > metricLogData = FOREACH metricLogLine GENERATE host, > >> > > REGEX_EXTRACT_ALL(body, > >> > > > >> > > > >> > > >> > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > >> > > AS regex; > >> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > >> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, > >> > FLATTEN(regex) > >> > > AS (category:CHARARRAY, event:CHARARRAY); > >> > > > >> > > Thanks > >> > > > >> > > >> > > > > > > > > -- > > Thanks > > > > Cameron Gandevia > > > > > > -- > Thanks > > Cameron Gandevia >
