My pigscript is taking a lot of time (>10 minutes) processing even < 100 rows 
of data and we have 105 map and reduce nodes. I am just wondering if we can you 
use a FILTER clause in the grouped data set. I realized I am filtering and 
grouping by the same key multiple times which will directly decrease the 
performance. So I am trying to find a way to filter the data on grouped data. 
Please see below the sample code for what I want to achieve and I am also 
including the original code which has multiple group bys an filter clauses 
which takes lot of time. I am trying to eliminate the yellow colored 
statements. We are using pig version 0.5. Any inputs for performance 
optimization is greatly appreciated.

RAW_DATA = LOAD 
'/omniture_test_qa/cleansed_output_1/2011/01/05/wdgafmfamily/wdgafmfamily*.tsv.gz'
 USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY (int)$6 <= 0 AND (chararray)$5=='N';
SELECT_DATA = FOREACH FILTER_EXCLUDES_DATA GENERATE (long)$0 AS hit_time_gmt, 
(long)$2 AS visid_high, (long)$3 AS visid_low, (int)$9 AS mobile_id, (int)$17 
AS page_event;
GROUP_BY_VISID_DATA = GROUP SELECT_DATA BY (visid_high,visid_low) PARALLEL 100;
METRICS_DATA = FOREACH GROUP_BY_VISID_DATA
{
FILTER_PV_DATA = FILTER GROUP_BY_VISID_DATA BY SELECT_DATA::page_event == 0;
FILTER_WIRELESS_PV_DATA = FILTER GROUP_BY_VISID_DATA BY SELECT_DATA::page_event 
== 0 AND SELECT_DATA::mobile_id > 0;
GENERATE FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS 
visid_low, FLATTEN(COUNT(SELECT_DATA)) AS 
PAGE_VIEW_COUNT,FLATTEN(COUNT(SELECT_DATA)) AS PAGE_VIEW_COUNT;
};
DUMP METRICS_DATA;

Original Code:

RAW_DATA = LOAD 
'/omniture_test_qa/cleansed_output_1/2011/01/05/wdgafmfamily/wdgafmfamily*.tsv.gz'
 USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY (int)$6 <= 0 AND (chararray)$5=='N';
SELECT_DATA = FOREACH FILTER_EXCLUDES_DATA GENERATE (long)$0 AS hit_time_gmt, 
(long)$2 AS visid_high, (long)$3 AS visid_low, (int)$9 AS mobile_id, (int)$17 
AS page_event;

--PV COUNT
FILTER_PV_DATA = FILTER SELECT_ DATA BY page_event == 0;
SELECT_PV_DATA = FOREACH FILTER_PV_DATA GENERATE visid_high,visid_low;
GROUP_BY_VISID_SWID_DATA = GROUP SELECT_PV_DATA BY (visid_high,visid_low) 
PARALLEL 100;
PAGE_VIEWS = FOREACH GROUP_BY_VISID_SWID_DATA GENERATE 
FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low, 
FLATTEN(COUNT(SELECT_PV_DATA)) AS PAGE_VIEW_COUNT;

--WIRELESS PVS COUNT
FILTER_WIRELESS_PV_DATA = FILTER SELECT_DATA BY page_event == 0 AND mobile_id > 
0;
SELECT_WIRELESS_PV_DATA = FOREACH FILTER_WIRELESS_PV_DATA GENERATE 
visid_high,visid_low;
GROUP_BY_VISID_SWID_WIRELESS_PV_DATA = GROUP SELECT_WIRELESS_PV_DATA BY 
(visid_high,visid_low) PARALLEL 100;
WIRELESS_PVS =  FOREACH GROUP_BY_VISID_SWID_WIRELESS_PV_DATA GENERATE  
FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low, 
FLATTEN(COUNT(SELECT_WIRELESS_PV_DATA)) AS WIRELESS_PV_COUNT;
COGROUPED_DAILY_METRICS_DATA = COGROUP PAGE_VIEWS BY (visid_high,visid_low) 
OUTER,WIRELESS_PVS  BY (visid_high,visid_low)  OUTER;
DUMP COGROUPED_DAILY_METRICS_DATA;

Thanks
Sri

Reply via email to