On 1 March 2011 18:02, Dan Brickley <[email protected]> wrote: > On 1 March 2011 17:56, Dmitriy Ryaboy <[email protected]> wrote: >> Hi Dan, >> iirc, registering a jar does not put it on the Pig client classpath, it just >> tells Pig to ship the jar. You want to put it on the PIG_CLASSPATH before >> you invoke pig. > > Perfect, that was exactly it. It's running now :)
That'll teach me to use words like "perfect". I get to run a job now, however ... 2011-03-01 19:25:54,167 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER 2011-03-01 19:25:54,167 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2011-03-01 19:25:54,235 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for tw06: $0, $1 2011-03-01 19:25:54,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: urls: Store(hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477:org.apache.pig.impl.io.InterStorage) - scope-27 Operator Key: scope-27) 2011-03-01 19:25:54,242 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2011-03-01 19:25:54,243 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2011-03-01 19:25:54,243 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2011-03-01 19:25:54,247 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2011-03-01 19:25:54,247 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2011-03-01 19:25:56,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2011-03-01 19:25:56,261 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2011-03-01 19:25:56,508 [Thread-19] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2011-03-01 19:25:56,509 [Thread-19] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2011-03-01 19:25:56,516 [Thread-19] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2011-03-01 19:25:56,763 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2011-03-01 19:25:57,395 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201102272217_0017 2011-03-01 19:25:57,395 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://node1.hdfs-hadoop.sara.nl:50030/jobdetails.jsp?jobid=job_201102272217_0017 2011-03-01 19:26:52,239 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201102272217_0017 has failed! Stop running all dependent jobs 2011-03-01 19:26:52,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2011-03-01 19:26:52,252 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 2997: Unable to recreate exception from backed error: java.io.IOException: Deserialization error: could not instantiate 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, String]' 2011-03-01 19:26:52,252 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2011-03-01 19:26:52,253 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-CDH3B4 0.8.0-CDH3B4 danbri 2011-03-01 19:25:54 2011-03-01 19:26:52 FILTER Failed! Failed Jobs: JobId Alias Feature Message Outputs job_201102272217_0017 tw06,urls,x MAP_ONLY Message: Job failed! Error - NA hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477, Input(s): Failed to read data from "/user/danbri/twitter/tweets2009-06.tab.txt.lzo" Output(s): Failed to produce result in "hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_201102272217_0017 2011-03-01 19:26:52,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2011-03-01 19:26:52,298 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: java.io.IOException: Deserialization error: could not instantiate 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, String]' Details at logfile: /home/danbri/twitter/pig_1299003251377.log Looking there, it seems something (not afaik the twitter-text nor my library) wants some Google Collections class. /home/danbri/twitter/pig_1299003251377.log -> java.io.IOException: Deserialization error: could not instantiate 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, String]' Caused by: java.lang.RuntimeException: could not instantiate 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, String]' Caused by: java.lang.reflect.InvocationTargetException Caused by: java.lang.NoClassDefFoundError: com/google/common/collect/Sets Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Sets Perhaps this is something wrong in our Pig setup. I'll get a .jar from http://code.google.com/p/google-collections/source/checkout and see if that'll fix it. [...] ... nope, added ./google-collect-snapshot.jar to PIG_CLASSPATH, but I'm getting the same behaviour. Investigating... Dan >> On Tue, Mar 1, 2011 at 5:57 AM, Dan Brickley <[email protected]> wrote: >>> >>> I'm trying to use InvokeForString to call a simple static method that >>> wraps http://mzsanford.github.com/twitter-text-java/docs/api/index.html >>> https://github.com/twitter/twitter-text-java ... specifically the >>> Extractor class extractURLs method. In fact since the logical result >>> is a list of URLs perhaps I should be writing proper Pig-centric >>> wrapper that returns a tuple, but for now I thought a stringified list >>> would be ok for my immediate purposes. That purpose being pulling out >>> all the URLs from a corpus of tweets, so we can expand the bit.ly and >>> other short urls... >>> >>> So - I built the extra class (src below) and packaged it inside the >>> twitter-text jar, and verify it's in there and usable as follows: >>> >>> danbri$ java -cp >>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar >>> tv.notube.TwitterExtractor "hello http://example.com/ >>> http://example.org/ world" >>> URLs: [http://example.com/, http://example.org/] >>> >>> Then from the same directory, I try run this as a Pig job: >>> >>> tw06 = load '/user/danbri/twitter/tweets2009-06.tab.txt.lzo' AS ( >>> when: chararray, who: chararray, msg: chararray); >>> REGISTER twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar; >>> DEFINE ExtractURLs InvokeForString('tv.notube.TwitterExtractor.urls', >>> 'String'); >>> urls = FOREACH tw06 GENERATE ExtractURLs(msg); >>> x = SAMPLE urls 0.001; >>> dump x; >>> >>> ...but we don't get past InvokeForString, >>> >>> 2011-03-01 14:50:31,033 [main] ERROR org.apache.pig.tools.grunt.Grunt >>> - ERROR 1000: Error during parsing. could not instantiate >>> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, >>> String]' >>> Details at logfile: /home/danbri/twitter/pig_1298987430385.log >>> ...-> >>> Caused by: java.lang.reflect.InvocationTargetException >>> Caused by: java.lang.ClassNotFoundException: tv.notube.TwitterExtractor >>> >>> I checked that Pig is finding the jar by mis-spelling the filename in >>> the "REGISTER" line (which as expected causes things to fail earlier). >>> Also double-check that the class is in the jar, >>> danbri$ jar -tvf >>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar | grep tv >>> 0 Tue Mar 01 12:03:04 CET 2011 tv/ >>> 0 Tue Mar 01 12:03:04 CET 2011 tv/notube/ >>> 1114 Tue Mar 01 13:40:30 CET 2011 tv/notube/TwitterExtractor.class >>> >>> ...so I'm finding myself stuck. I'm sure the answer is staring me in >>> the face, but I can't see it. Perhaps I should just do things properly >>> with "extends EvalFunc<String>" and return the tuples separately >>> anyway... >>> >>> Thanks for any pointers, >>> >>> Dan >>> >>> >>> package tv.notube; >>> import com.twitter.Extractor; >>> import java.util.List; >>> class TwitterExtractor { >>> >>> public static void main (String[] args) { >>> String in = args[0]; >>> System.out.println("URLs: " + urls(in)); >>> } >>> >>> public static String urls(String tweet) { >>> Extractor ex = new Extractor(); >>> List urls = ex.extractURLs(tweet); >>> String o = urls.toString(); >>> return o; >>> } >>> } >> >> >
