In the pig documentation there is a section title Reduce your operator pipeline which talks about combining foreach statements as an optimization. It also mentions you should do the same for filter statements. Is this incorrect?
On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <[email protected]>wrote: > Cool thanks > > > On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <[email protected]> wrote: > >> Just to be explicit: >> >> This: >> >> x = FILTER something by num1 > 10 AND num2 < 12; >> >> is equivalent to this: >> >> x = FILTER something by num1 > 10; >> x = FILTER x by num2 < 12; >> >> All non-blocking operators are evaluated in a streaming fashion, so you >> don't need to worry about combining them into a single operator. >> >> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[email protected] >> >wrote: >> >> > Hi Cameron, >> > >> > Your script looks alright. Each of your steps process data in different >> > ways. Instead of cramming together them in a single statement (possibly >> via >> > some custom UDF), it makes sense to have them in a series of steps as >> you >> > have done for better readability and debuggability. Are you worried >> about >> > performance? You need not to. As long as your operations don't >> introduce a >> > unnecessary map-reduce boundary (which your script doesn't) you are >> good. >> > >> > Hope it helps, >> > Ashutosh >> > >> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[email protected]> >> > wrote: >> > >> > > Hey >> > > >> > > I am trying to extract performance metrics from some of my logs using >> Pig >> > > and have come up with the following. I feel like I might be performing >> > one >> > > too many steps and was wondering if there is a way to reduce the >> number >> > of >> > > FILTER/FOREACH operations I need to run. Still trying to learn the >> proper >> > > syntax. >> > > >> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as >> > > body:CHARARRAY; >> > > metricLogLine = FILTER uniqLogs BY (body MATCHES >> > > '.*gr.perf.metrics.Category.*'); >> > > metricLogData = FOREACH metricLogLine GENERATE host, >> > > REGEX_EXTRACT_ALL(body, >> > > >> > > >> > >> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') >> > > AS regex; >> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; >> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, >> > FLATTEN(regex) >> > > AS (category:CHARARRAY, event:CHARARRAY); >> > > >> > > Thanks >> > > >> > >> > > > > -- > Thanks > > Cameron Gandevia > -- Thanks Cameron Gandevia
