I'm working on a crash processing system and trying to group large
amounts of data on multiple facets. Loading the data can be expensive,
so I'd really like to use a single map job. I understand that
multi-query execution in theory allows for multiple STORE commands to
come from a single map execution. Is there a way to EXPLAIN the plan of
an entire pig script that has multiple STORE commands, to tell how it's
going to run mapreduce? I can only see a way to run EXPLAIN on a single
relation, which shows a single mapreduce but doesn't really tell how
they might be combined with multiquery execution. I'm trying to figure
out whether pig will use a single map for the following pig statement,
or whether there is a way to make it use a single map.
raw = LOAD ...;
processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID,
ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data into
these fields */
filtered = FILTERED processed BY some conditions here;
bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);
byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature,
group.AdapterVendorID, COUNT(filtered) AS c;
STORE byvendortotals INTO ....;
withextensions = FOREACH filtered GENERATE signature,
flatten(ExtensionsInstalled);
byextension = GROUP withextensions BY (signature, extensionID);
byextensiontotals = FOREACH byextension GENERATE group.signature,
group.extensionID, COUNT(withextensions) AS c;
STORE byextensiontotals INTO ...;
--BDS