Thank you, after replacing pig.jar and pig-withouthadoop.jar with the 0.11 ones from the svn trunk, they work like a charm.
Python UDFs are slower than the equivalent Java versions, but that's another topic. On Fri, Jul 20, 2012 at 7:42 PM, Duckworth, Will <[email protected]> wrote: > I think you should take a look at this ticket: > > https://issues.apache.org/jira/browse/PIG-2761 > > And this thread: > > http://search-hadoop.com/m/gv0122Ls5N11&subj=Deserialization+error+when+using+Jython+UDF+in+Pig+0+10+script > > Thanks. > > > > Will Duckworth Senior Vice President, Software Engineering | comScore, > Inc.(NASDAQ:SCOR) > o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:[email protected] > ..................................................................................................... > > Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral > measurement > www.comscore.com/MobileMetrix > -----Original Message----- > From: MiaoMiao [mailto:[email protected]] > Sent: Friday, July 20, 2012 3:11 AM > To: [email protected] > Subject: Can't use python UDF in MapReduce mode > > Hi all, > > I've been using apache pig to do some ETL work, but ran into a weird problem > today when trying pyhon UDFs. > > I borrowed an example from > http://sundaycomputing.blogspot.com/2011/01/python-udfs-from-pig-scripts.html > > And it worked well in local mode, but not MapReduce mode. > > Since my team have already been using pig for quite a while, it's really hard > to drop it, so please, if anyone could help. > > Here I posted my .py and .pig, and errors coming up. > > [Rufus@master1 ~] hadoop fs -cat /hdfs/testudf.txt > > Deepak 22 India > Chaitanya 19 India > Sachin 36 India > Barack 50 USA > > > [Rufus@master1 ~] cat pyudf.py > > #!/usr/bin/python > @outputSchema("line:chararray") > def split_into_fields(input_line): > return input_line > > > [Rufus@master1 ~] cat pyudf.pig > > REGISTER pyudf.py USING jython AS udf; > records = LOAD '/test/testudf.txt' using PigStorage('\n') AS > (input_line:chararray); schema_records = FOREACH records GENERATE > udf.split_into_fields(input_line); > DUMP schema_records; > > > local mode result: > (Deepak 22 India) > (Chaitanya 19 India) > (Sachin 36 India) > (Barack 50 USA) > > > MapReduce mode result: > 2012-07-20 15:09:03,322 [main] INFO org.apache.pig.Main - Apache Pig version > 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12 > 2012-07-20 15:09:03,322 [main] INFO org.apache.pig.Main - Logging error > messages to: /root/pig_1342768143321.log > 2012-07-20 15:09:03,518 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: hdfs://master1 > 2012-07-20 15:09:03,568 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to map-reduce job tracker at: master1:54311 > 2012-07-20 15:09:03,630 [main] INFO > org.apache.pig.scripting.jython.JythonScriptEngine - created tmp > python.cachedir=/tmp/pig_jython_7427830580471090032 > *sys-package-mgr*: processing new jar, '/usr/java/jdk1.6.0_29/lib/tools.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/hadoop-core-0.20.2-Intel.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-httpclient-3.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/core-3.1.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/derbyclient.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/derbytools.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-Intel.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/intel-hadoop-lzo-20110718111837.2bd0d5b.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jetty-6.1.26.patched.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.patched.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.patched.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/junit-4.5.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar' > *sys-package-mgr*: processing new jar, '/usr/lib/pig/lib/automaton.jar' > *sys-package-mgr*: processing new jar, '/usr/lib/pig/lib/jython-2.5.0.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/pig/pig-0.10.0-withouthadoop.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/hadoop-0.20/contrib/capacity-scheduler/hadoop-capacity-scheduler-0.20.2-Intel.jar' > *sys-package-mgr*: processing new jar, '/usr/lib/hbase/hbase-0.90.5-Intel.jar' > *sys-package-mgr*: processing new jar, > '/usr/lib/zookeeper/zookeeper-3.3.4-Intel.jar' > *sys-package-mgr*: processing new jar, > '/home/apache-hadoop/apache/hbase-0.92.1/lib/guava-r09.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/resources.jar' > *sys-package-mgr*: processing new jar, '/usr/java/jdk1.6.0_29/jre/lib/rt.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/jsse.jar' > *sys-package-mgr*: processing new jar, '/usr/java/jdk1.6.0_29/jre/lib/jce.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/charsets.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/ext/localedata.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/ext/dnsns.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/ext/sunpkcs11.jar' > *sys-package-mgr*: processing new jar, > '/usr/java/jdk1.6.0_29/jre/lib/ext/sunjce_provider.jar' > 2012-07-20 15:09:07,861 [main] INFO > org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: > udf.split_into_fields > 2012-07-20 15:09:08,074 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > script: UNKNOWN > 2012-07-20 15:09:08,186 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler > - File concatenation threshold: 100 optimistic? false > 2012-07-20 15:09:08,198 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2012-07-20 15:09:08,198 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2012-07-20 15:09:08,234 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to > the job > 2012-07-20 15:09:08,241 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 > 2012-07-20 15:09:08,243 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - creating jar file Job6969787787807908978.jar > 2012-07-20 15:09:11,320 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - jar file Job6969787787807908978.jar created > 2012-07-20 15:09:11,329 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2012-07-20 15:09:11,351 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 1 map-reduce job(s) waiting for submission. > 2012-07-20 15:09:11,851 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2012-07-20 15:09:11,941 [Thread-7] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to > process : 1 > 2012-07-20 15:09:11,941 [Thread-7] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : 1 > 2012-07-20 15:09:11,948 [Thread-7] WARN > org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is > available > 2012-07-20 15:09:11,948 [Thread-7] INFO > org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library > 2012-07-20 15:09:11,948 [Thread-7] INFO > org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded > 2012-07-20 15:09:11,950 [Thread-7] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths (combined) to process : 1 > 2012-07-20 15:09:13,277 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - HadoopJobId: job_201207191034_0115 > 2012-07-20 15:09:13,277 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - More information at: > http://master1:50030/jobdetails.jsp?jobid=job_201207191034_0115 > 2012-07-20 15:09:57,925 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - job job_201207191034_0115 has failed! Stop running all dependent jobs > 2012-07-20 15:09:57,925 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2012-07-20 15:09:57,934 [main] ERROR > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate > exception from backed error: java.io.IOException: > Deserialization error: could not instantiate > 'org.apache.pig.scripting.jython.JythonFunction' with arguments > '[/home/Rufus/pyudf.py, split_into_fields]' > at > org.apache.pig.impl.util.ObjectSerializer.deserialize(ObjectSerializer.java:55) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.setup(PigGenericMapBase.java:177) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > Caused by: java.lang.RuntimeException: could not instantiate > 'org.apache.pig.scripting.jython.JythonFunction' with arguments > '[/home/Rufus/pyudf.p > 2012-07-20 15:09:57,934 [main] ERROR > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! > 2012-07-20 15:09:57,935 [main] INFO > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt > Features > 0.20.2-Intel 0.10.0 root 2012-07-20 15:09:08 2012-07-20 15:09:57 > UNKNOWN > > Failed! > > Failed Jobs: > JobId Alias Feature Message Outputs > job_201207191034_0115 records,schema_records MAP_ONLY Message: Job > failed! Error - NA hdfs://master1/tmp/temp2070108792/tmp-630040933, > > Input(s): > Failed to read data from "/hdfs/testudf.txt" > > Output(s): > Failed to produce result in "hdfs://master1/tmp/temp2070108792/tmp-630040933" > > Counters: > Total records written : 0 > Total bytes written : 0 > Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > Job DAG: > job_201207191034_0115 > > > 2012-07-20 15:09:57,936 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed! > 2012-07-20 15:09:57,971 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 2997: Unable to recreate exception from backed error: > java.io.IOException: Deserialization error: could not instantiate > 'org.apache.pig.scripting.jython.JythonFunction' with arguments > '[/home/Rufus/pyudf.py, split_into_fields]' > at > org.apache.pig.impl.util.ObjectSerializer.deserialize(ObjectSerializer.java:55) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.setup(PigGenericMapBase.java:177) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:656) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > Caused by: java.lang.RuntimeException: could not instantiate > 'org.apache.pig.scripting.jython.JythonFunction' with arguments > '[/wtt/home/pyudf.p Details at logfile: /root/pig_1342768143321.log
