Just to be explicit: This:
x = FILTER something by num1 > 10 AND num2 < 12; is equivalent to this: x = FILTER something by num1 > 10; x = FILTER x by num2 < 12; All non-blocking operators are evaluated in a streaming fashion, so you don't need to worry about combining them into a single operator. On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[email protected]>wrote: > Hi Cameron, > > Your script looks alright. Each of your steps process data in different > ways. Instead of cramming together them in a single statement (possibly via > some custom UDF), it makes sense to have them in a series of steps as you > have done for better readability and debuggability. Are you worried about > performance? You need not to. As long as your operations don't introduce a > unnecessary map-reduce boundary (which your script doesn't) you are good. > > Hope it helps, > Ashutosh > > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[email protected]> > wrote: > > > Hey > > > > I am trying to extract performance metrics from some of my logs using Pig > > and have come up with the following. I feel like I might be performing > one > > too many steps and was wondering if there is a way to reduce the number > of > > FILTER/FOREACH operations I need to run. Still trying to learn the proper > > syntax. > > > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > > body:CHARARRAY; > > metricLogLine = FILTER uniqLogs BY (body MATCHES > > '.*gr.perf.metrics.Category.*'); > > metricLogData = FOREACH metricLogLine GENERATE host, > > REGEX_EXTRACT_ALL(body, > > > > > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > > AS regex; > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, > FLATTEN(regex) > > AS (category:CHARARRAY, event:CHARARRAY); > > > > Thanks > > >
