argh. try doing both -- add google-collections to the classpath, *and* register it. Same with twitter-text.
D On Tue, Mar 1, 2011 at 10:56 AM, Dan Brickley <[email protected]> wrote: > On 1 March 2011 18:02, Dan Brickley <[email protected]> wrote: > > On 1 March 2011 17:56, Dmitriy Ryaboy <[email protected]> wrote: > >> Hi Dan, > >> iirc, registering a jar does not put it on the Pig client classpath, it > just > >> tells Pig to ship the jar. You want to put it on the PIG_CLASSPATH > before > >> you invoke pig. > > > > Perfect, that was exactly it. It's running now :) > > That'll teach me to use words like "perfect". I get to run a job now, > however ... > > 2011-03-01 19:25:54,167 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > script: FILTER > 2011-03-01 19:25:54,167 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > pig.usenewlogicalplan is set to true. New logical plan will be used. > 2011-03-01 19:25:54,235 [main] INFO > org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns > pruned for tw06: $0, $1 > 2011-03-01 19:25:54,241 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > (Name: urls: Store(hdfs:// > node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477:org.apache.pig.impl.io.InterStorage > ) > - scope-27 Operator Key: scope-27) > 2011-03-01 19:25:54,242 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler > - File concatenation threshold: 100 optimistic? false > 2011-03-01 19:25:54,243 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2011-03-01 19:25:54,243 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2011-03-01 19:25:54,247 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are > added to the job > 2011-03-01 19:25:54,247 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - mapred.job.reduce.markreset.buffer.percent is not set, set to > default 0.3 > 2011-03-01 19:25:56,254 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2011-03-01 19:25:56,261 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 1 map-reduce job(s) waiting for submission. > 2011-03-01 19:25:56,508 [Thread-19] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > paths to process : 1 > 2011-03-01 19:25:56,509 [Thread-19] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input paths to process : 1 > 2011-03-01 19:25:56,516 [Thread-19] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input paths (combined) to process : 1 > 2011-03-01 19:25:56,763 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2011-03-01 19:25:57,395 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - HadoopJobId: job_201102272217_0017 > 2011-03-01 19:25:57,395 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - More information at: > > http://node1.hdfs-hadoop.sara.nl:50030/jobdetails.jsp?jobid=job_201102272217_0017 > 2011-03-01 19:26:52,239 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - job job_201102272217_0017 has failed! Stop running all dependent > jobs > 2011-03-01 19:26:52,241 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2011-03-01 19:26:52,252 [main] ERROR > org.apache.pig.tools.pigstats.PigStats - ERROR 2997: Unable to > recreate exception from backed error: java.io.IOException: > Deserialization error: could not instantiate 'InvokeForString' with > arguments '[tv.notube.TwitterExtractor.urls, String]' > 2011-03-01 19:26:52,252 [main] ERROR > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) > failed! > 2011-03-01 19:26:52,253 [main] INFO > org.apache.pig.tools.pigstats.PigStats - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt > Features > 0.20.2-CDH3B4 0.8.0-CDH3B4 danbri 2011-03-01 19:25:54 > 2011-03-01 19:26:52 FILTER > > Failed! > > Failed Jobs: > JobId Alias Feature Message Outputs > job_201102272217_0017 tw06,urls,x MAP_ONLY Message: Job > failed! Error - NA > hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477, > > Input(s): > Failed to read data from "/user/danbri/twitter/tweets2009-06.tab.txt.lzo" > > Output(s): > Failed to produce result in > "hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477" > > Counters: > Total records written : 0 > Total bytes written : 0 > Spillable Memory Manager spill count : 0 > Total bags proactively spilled: 0 > Total records proactively spilled: 0 > > Job DAG: > job_201102272217_0017 > > > 2011-03-01 19:26:52,253 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed! > 2011-03-01 19:26:52,298 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 2997: Unable to recreate exception from backed error: > java.io.IOException: Deserialization error: could not instantiate > 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, > String]' > Details at logfile: /home/danbri/twitter/pig_1299003251377.log > > Looking there, it seems something (not afaik the twitter-text nor my > library) wants some Google Collections class. > > /home/danbri/twitter/pig_1299003251377.log > -> > java.io.IOException: Deserialization error: could not instantiate > 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, > String]' > Caused by: java.lang.RuntimeException: could not instantiate > 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, > String]' > Caused by: java.lang.reflect.InvocationTargetException > Caused by: java.lang.NoClassDefFoundError: com/google/common/collect/Sets > Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Sets > > Perhaps this is something wrong in our Pig setup. I'll get a .jar from > http://code.google.com/p/google-collections/source/checkout and see if > that'll fix it. [...] ... nope, added ./google-collect-snapshot.jar to > PIG_CLASSPATH, but I'm getting the same behaviour. > > Investigating... > > Dan > > >> On Tue, Mar 1, 2011 at 5:57 AM, Dan Brickley <[email protected]> wrote: > >>> > >>> I'm trying to use InvokeForString to call a simple static method that > >>> wraps > http://mzsanford.github.com/twitter-text-java/docs/api/index.html > >>> https://github.com/twitter/twitter-text-java ... specifically the > >>> Extractor class extractURLs method. In fact since the logical result > >>> is a list of URLs perhaps I should be writing proper Pig-centric > >>> wrapper that returns a tuple, but for now I thought a stringified list > >>> would be ok for my immediate purposes. That purpose being pulling out > >>> all the URLs from a corpus of tweets, so we can expand the bit.ly and > >>> other short urls... > >>> > >>> So - I built the extra class (src below) and packaged it inside the > >>> twitter-text jar, and verify it's in there and usable as follows: > >>> > >>> danbri$ java -cp > >>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar > >>> tv.notube.TwitterExtractor "hello http://example.com/ > >>> http://example.org/ world" > >>> URLs: [http://example.com/, http://example.org/] > >>> > >>> Then from the same directory, I try run this as a Pig job: > >>> > >>> tw06 = load '/user/danbri/twitter/tweets2009-06.tab.txt.lzo' AS ( > >>> when: chararray, who: chararray, msg: chararray); > >>> REGISTER twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar; > >>> DEFINE ExtractURLs InvokeForString('tv.notube.TwitterExtractor.urls', > >>> 'String'); > >>> urls = FOREACH tw06 GENERATE ExtractURLs(msg); > >>> x = SAMPLE urls 0.001; > >>> dump x; > >>> > >>> ...but we don't get past InvokeForString, > >>> > >>> 2011-03-01 14:50:31,033 [main] ERROR org.apache.pig.tools.grunt.Grunt > >>> - ERROR 1000: Error during parsing. could not instantiate > >>> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, > >>> String]' > >>> Details at logfile: /home/danbri/twitter/pig_1298987430385.log > >>> ...-> > >>> Caused by: java.lang.reflect.InvocationTargetException > >>> Caused by: java.lang.ClassNotFoundException: tv.notube.TwitterExtractor > >>> > >>> I checked that Pig is finding the jar by mis-spelling the filename in > >>> the "REGISTER" line (which as expected causes things to fail earlier). > >>> Also double-check that the class is in the jar, > >>> danbri$ jar -tvf > >>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar | grep tv > >>> 0 Tue Mar 01 12:03:04 CET 2011 tv/ > >>> 0 Tue Mar 01 12:03:04 CET 2011 tv/notube/ > >>> 1114 Tue Mar 01 13:40:30 CET 2011 tv/notube/TwitterExtractor.class > >>> > >>> ...so I'm finding myself stuck. I'm sure the answer is staring me in > >>> the face, but I can't see it. Perhaps I should just do things properly > >>> with "extends EvalFunc<String>" and return the tuples separately > >>> anyway... > >>> > >>> Thanks for any pointers, > >>> > >>> Dan > >>> > >>> > >>> package tv.notube; > >>> import com.twitter.Extractor; > >>> import java.util.List; > >>> class TwitterExtractor { > >>> > >>> public static void main (String[] args) { > >>> String in = args[0]; > >>> System.out.println("URLs: " + urls(in)); > >>> } > >>> > >>> public static String urls(String tweet) { > >>> Extractor ex = new Extractor(); > >>> List urls = ex.extractURLs(tweet); > >>> String o = urls.toString(); > >>> return o; > >>> } > >>> } > >> > >> > > >
