My pigscript is taking a lot of time (>10 minutes) processing even < 100 rows
of data and we have 105 map and reduce nodes. I am just wondering if we can you
use a FILTER clause in the grouped data set. I realized I am filtering and
grouping by the same key multiple times which will directly decrease the
performance. So I am trying to find a way to filter the data on grouped data.
Please see below the sample code for what I want to achieve and I am also
including the original code which has multiple group bys an filter clauses
which takes lot of time. I am trying to eliminate the yellow colored
statements. We are using pig version 0.5. Any inputs for performance
optimization is greatly appreciated.
RAW_DATA = LOAD
'/omniture_test_qa/cleansed_output_1/2011/01/05/wdgafmfamily/wdgafmfamily*.tsv.gz'
USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY (int)$6 <= 0 AND (chararray)$5=='N';
SELECT_DATA = FOREACH FILTER_EXCLUDES_DATA GENERATE (long)$0 AS hit_time_gmt,
(long)$2 AS visid_high, (long)$3 AS visid_low, (int)$9 AS mobile_id, (int)$17
AS page_event;
GROUP_BY_VISID_DATA = GROUP SELECT_DATA BY (visid_high,visid_low) PARALLEL 100;
METRICS_DATA = FOREACH GROUP_BY_VISID_DATA
{
FILTER_PV_DATA = FILTER GROUP_BY_VISID_DATA BY SELECT_DATA::page_event == 0;
FILTER_WIRELESS_PV_DATA = FILTER GROUP_BY_VISID_DATA BY SELECT_DATA::page_event
== 0 AND SELECT_DATA::mobile_id > 0;
GENERATE FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS
visid_low, FLATTEN(COUNT(SELECT_DATA)) AS
PAGE_VIEW_COUNT,FLATTEN(COUNT(SELECT_DATA)) AS PAGE_VIEW_COUNT;
};
DUMP METRICS_DATA;
Original Code:
RAW_DATA = LOAD
'/omniture_test_qa/cleansed_output_1/2011/01/05/wdgafmfamily/wdgafmfamily*.tsv.gz'
USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY (int)$6 <= 0 AND (chararray)$5=='N';
SELECT_DATA = FOREACH FILTER_EXCLUDES_DATA GENERATE (long)$0 AS hit_time_gmt,
(long)$2 AS visid_high, (long)$3 AS visid_low, (int)$9 AS mobile_id, (int)$17
AS page_event;
--PV COUNT
FILTER_PV_DATA = FILTER SELECT_ DATA BY page_event == 0;
SELECT_PV_DATA = FOREACH FILTER_PV_DATA GENERATE visid_high,visid_low;
GROUP_BY_VISID_SWID_DATA = GROUP SELECT_PV_DATA BY (visid_high,visid_low)
PARALLEL 100;
PAGE_VIEWS = FOREACH GROUP_BY_VISID_SWID_DATA GENERATE
FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low,
FLATTEN(COUNT(SELECT_PV_DATA)) AS PAGE_VIEW_COUNT;
--WIRELESS PVS COUNT
FILTER_WIRELESS_PV_DATA = FILTER SELECT_DATA BY page_event == 0 AND mobile_id >
0;
SELECT_WIRELESS_PV_DATA = FOREACH FILTER_WIRELESS_PV_DATA GENERATE
visid_high,visid_low;
GROUP_BY_VISID_SWID_WIRELESS_PV_DATA = GROUP SELECT_WIRELESS_PV_DATA BY
(visid_high,visid_low) PARALLEL 100;
WIRELESS_PVS = FOREACH GROUP_BY_VISID_SWID_WIRELESS_PV_DATA GENERATE
FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low,
FLATTEN(COUNT(SELECT_WIRELESS_PV_DATA)) AS WIRELESS_PV_COUNT;
COGROUPED_DAILY_METRICS_DATA = COGROUP PAGE_VIEWS BY (visid_high,visid_low)
OUTER,WIRELESS_PVS BY (visid_high,visid_low) OUTER;
DUMP COGROUPED_DAILY_METRICS_DATA;
Thanks
Sri