I'm working on a crash processing system and trying to group large amounts of data on multiple facets. Loading the data can be expensive, so I'd really like to use a single map job. I understand that multi-query execution in theory allows for multiple STORE commands to come from a single map execution. Is there a way to EXPLAIN the plan of an entire pig script that has multiple STORE commands, to tell how it's going to run mapreduce? I can only see a way to run EXPLAIN on a single relation, which shows a single mapreduce but doesn't really tell how they might be combined with multiquery execution. I'm trying to figure out whether pig will use a single map for the following pig statement, or whether there is a way to make it use a single map.

raw = LOAD ...;
processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID, ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data into these fields */
filtered = FILTERED processed BY some conditions here;

bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);
byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature, group.AdapterVendorID, COUNT(filtered) AS c;

STORE byvendortotals INTO ....;

withextensions = FOREACH filtered GENERATE signature, flatten(ExtensionsInstalled);
byextension = GROUP withextensions BY (signature, extensionID);
byextensiontotals = FOREACH byextension GENERATE group.signature, group.extensionID, COUNT(withextensions) AS c;

STORE byextensiontotals INTO ...;

--BDS

Reply via email to