Hey I am trying to extract performance metrics from some of my logs using Pig and have come up with the following. I feel like I might be performing one too many steps and was wondering if there is a way to reduce the number of FILTER/FOREACH operations I need to run. Still trying to learn the proper syntax.
uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as body:CHARARRAY; metricLogLine = FILTER uniqLogs BY (body MATCHES '.*gr.perf.metrics.Category.*'); metricLogData = FOREACH metricLogLine GENERATE host, REGEX_EXTRACT_ALL(body, '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') AS regex; fltrdMetricLogData = FILTER metricLogData BY regex is not null; eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex) AS (category:CHARARRAY, event:CHARARRAY); Thanks
