Performing multiple reductions from a single map job

Benjamin Smedberg Mon, 11 Mar 2013 08:30:08 -0700

I'm working on a crash processing system and trying to group largeamounts of data on multiple facets. Loading the data can be expensive,so I'd really like to use a single map job. I understand thatmulti-query execution in theory allows for multiple STORE commands tocome from a single map execution. Is there a way to EXPLAIN the plan ofan entire pig script that has multiple STORE commands, to tell how it'sgoing to run mapreduce? I can only see a way to run EXPLAIN on a singlerelation, which shows a single mapreduce but doesn't really tell howthey might be combined with multiquery execution. I'm trying to figureout whether pig will use a single map for the following pig statement,or whether there is a way to make it use a single map.


raw = LOAD ...;

processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID,ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data intothese fields */

filtered = FILTERED processed BY some conditions here;


bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);

byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature,group.AdapterVendorID, COUNT(filtered) AS c;


STORE byvendortotals INTO ....;

withextensions = FOREACH filtered GENERATE signature,flatten(ExtensionsInstalled);

byextension = GROUP withextensions BY (signature, extensionID);

byextensiontotals = FOREACH byextension GENERATE group.signature,group.extensionID, COUNT(withextensions) AS c;


STORE byextensiontotals INTO ...;

--BDS

Performing multiple reductions from a single map job

Reply via email to