Hi Pig community, I am running a pig process using a python UDF, and getting a failure that is hard to debug. The relevant parts of the script are:
REGISTER [...]clustercentroid_udfs.py using jython as UDFS ; [... definition of cluster_vals ...] grouped = group cluster_vals by (clusters::cluster_id, tfidf::att, clusters::block_size); cluster_tfidf = foreach grouped { generate group.clusters::cluster_id as cluster_id, group.clusters::block_size as block_size, group.tfidf::att as att, UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid; } store cluster_tfidf into [...] I can remove essentially all the logic from UDFS.normalize_avg_words and still get the failure, for example I get the failure with this definition of normalize_avg_words(): @outputSchema('words: {wvpairs: (word: chararray, normvalue: double)}') def normalize_avg_words(line): return [] The log for the failing task has 2015-12-09 16:18:47,510 INFO [main] org.apache.pig.data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code. 2015-12-09 16:18:47,534 INFO [main] org.apache.pig.scripting.jython.JythonScriptEngine: created tmp python.cachedir=/data/3/yarn/nm/usercache/sesadmin/appcache/application_1444666458457_553099/container_e17_1444666458457_553099_01_685857/tmp/pig_jython_6256288828533965407 2015-12-09 16:18:49,443 INFO [main] org.apache.pig.scripting.jython.JythonFunction: Schema 'words: {wvpairs: (word: chararray, normvalue: double)}' defined for func normalize_avg_words 2015-12-09 16:18:49,498 INFO [main] org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce: Aliases being processed per job phase (AliasName[line,offset]): M: grouped[87,10] C: R: cluster_tfidf[99,16] 2015-12-09 16:18:49,511 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:638) at java.util.ArrayList.get(ArrayList.java:414) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:118) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:348) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNextTuple(POPackage.java:269) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:421) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) I do not get the failure if I use just a 5GB segment of my full 300GB data set. Also, I do not get the failure if I comment out the call to the UDF: cluster_tfidf = foreach grouped { generate group.clusters::cluster_id as cluster_id, group.clusters::block_size as block_size, group.tfidf::att as att; -- UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid; } I wonder if the failure is ultimately caused by an out of memory someplace, but I haven't seen anything in the log that indicates that directly. (I have tried using a large number of reducers in the definition of grouped but the result is the same). What should I look for in the log that would be a telltale for out of memory? How would I address it? Since I don't get the failure when the UDF call is commented out, I wonder if the problem is in the call itself, but don't know how to diagnose or debug that. Any help would be much appreciated! Apache Pig version 0.12.0-cdh5.3.3 (rexported) Hadoop 2.5.0-cdh5.3.3 Thanks, Will William F Dowling Senior Technologist Thomson Reuters