Hi Cameron, Your script looks alright. Each of your steps process data in different ways. Instead of cramming together them in a single statement (possibly via some custom UDF), it makes sense to have them in a series of steps as you have done for better readability and debuggability. Are you worried about performance? You need not to. As long as your operations don't introduce a unnecessary map-reduce boundary (which your script doesn't) you are good.
Hope it helps, Ashutosh On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[email protected]> wrote: > Hey > > I am trying to extract performance metrics from some of my logs using Pig > and have come up with the following. I feel like I might be performing one > too many steps and was wondering if there is a way to reduce the number of > FILTER/FOREACH operations I need to run. Still trying to learn the proper > syntax. > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > body:CHARARRAY; > metricLogLine = FILTER uniqLogs BY (body MATCHES > '.*gr.perf.metrics.Category.*'); > metricLogData = FOREACH metricLogLine GENERATE host, > REGEX_EXTRACT_ALL(body, > > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > AS regex; > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex) > AS (category:CHARARRAY, event:CHARARRAY); > > Thanks >
