I have a data input of aliases and many identifying attributes per each alias. The order of aliases is ~1E8 and for all attributes is ~1E5. I am attempting to generate a network of alias-alias commutative parings which share at least one attribute in common. For the rotation, a vast majority of the attributes contain a relatively small number of corresponding aliases ~1E3 - except for a few, whereas these <1% of attributes have corresponding aliases on the order of the entire input alias set ~1E8.
I am running into an issue with respect to these large alias <1% attributes tasks. The reducers for some of these tasks are taking many orders of magnitude longer to complete than the other 99% (on the order of many hours to minutes). A representation of the script is below (Pig 0.11.2): SET default_parallel $REDUCERS; SET pig.schematuple true; SET pig.exec.mapPartAgg true; SET output.compression.enabled true; SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec; X = LOAD '$INPUT/user_item' USING PigStorage() AS (alias:chararray, attributeURI:chararray); A1 = FOREACH X GENERATE *; A2 = FOREACH X GENERATE *; A3 = JOIN A1 BY (attributeURI), A2 BY (attributeURI); A4 = FILTER A3 BY (A1::alias != A2::alias); A5 = FOREACH A4 GENERATE A1::alias, A2::alias; --projection bc X contains other fields not shown here A6 = DISTINCT A5; STORE A6 INTO '$OUTPUT/network' USING PigStorage(); Here, Reducer steps A4, A5 are taking forever on a handful of reducer tasks, likely related to the <1% attributes issues described above. Is there a better way to optimize this script? An example of the input X: aa, cat aa, dog bb, dog bb, bear cc, cat dd, bird An example of the output A6: aa, bb aa, cc aa, dd bb, aa cc, aa Many Thanks. -Dan
