Re: Using dynamic invokers (InvokeForString)

Dmitriy Ryaboy Tue, 01 Mar 2011 11:08:14 -0800

argh.

try doing both -- add google-collections to the classpath, *and* register
it. Same with twitter-text.


D


On Tue, Mar 1, 2011 at 10:56 AM, Dan Brickley <[email protected]> wrote:

> On 1 March 2011 18:02, Dan Brickley <[email protected]> wrote:
> > On 1 March 2011 17:56, Dmitriy Ryaboy <[email protected]> wrote:
> >> Hi Dan,
> >> iirc, registering a jar does not put it on the Pig client classpath, it
> just
> >> tells Pig to ship the jar. You want to put it on the PIG_CLASSPATH
> before
> >> you invoke pig.
> >
> > Perfect, that was exactly it. It's running now :)
>
> That'll teach me to use words like "perfect". I get to run a job now,
> however ...
>
> 2011-03-01 19:25:54,167 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: FILTER
> 2011-03-01 19:25:54,167 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2011-03-01 19:25:54,235 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
> pruned for tw06: $0, $1
> 2011-03-01 19:25:54,241 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> (Name: urls: Store(hdfs://
> node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477:org.apache.pig.impl.io.InterStorage
> )
> - scope-27 Operator Key: scope-27)
> 2011-03-01 19:25:54,242 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> 2011-03-01 19:25:54,243 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2011-03-01 19:25:54,243 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2011-03-01 19:25:54,247 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added to the job
> 2011-03-01 19:25:54,247 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to
> default 0.3
> 2011-03-01 19:25:56,254 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2011-03-01 19:25:56,261 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2011-03-01 19:25:56,508 [Thread-19] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1
> 2011-03-01 19:25:56,509 [Thread-19] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1
> 2011-03-01 19:25:56,516 [Thread-19] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 1
> 2011-03-01 19:25:56,763 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2011-03-01 19:25:57,395 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_201102272217_0017
> 2011-03-01 19:25:57,395 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - More information at:
>
> http://node1.hdfs-hadoop.sara.nl:50030/jobdetails.jsp?jobid=job_201102272217_0017
> 2011-03-01 19:26:52,239 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - job job_201102272217_0017 has failed! Stop running all dependent
> jobs
> 2011-03-01 19:26:52,241 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2011-03-01 19:26:52,252 [main] ERROR
> org.apache.pig.tools.pigstats.PigStats - ERROR 2997: Unable to
> recreate exception from backed error: java.io.IOException:
> Deserialization error: could not instantiate 'InvokeForString' with
> arguments '[tv.notube.TwitterExtractor.urls, String]'
> 2011-03-01 19:26:52,252 [main] ERROR
> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s)
> failed!
> 2011-03-01 19:26:52,253 [main] INFO
> org.apache.pig.tools.pigstats.PigStats - Script Statistics:
>
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>  Features
> 0.20.2-CDH3B4   0.8.0-CDH3B4    danbri  2011-03-01 19:25:54
> 2011-03-01 19:26:52     FILTER
>
> Failed!
>
> Failed Jobs:
> JobId   Alias   Feature Message Outputs
> job_201102272217_0017   tw06,urls,x     MAP_ONLY        Message: Job
> failed! Error - NA
> hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477,
>
> Input(s):
> Failed to read data from "/user/danbri/twitter/tweets2009-06.tab.txt.lzo"
>
> Output(s):
> Failed to produce result in
> "hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477"
>
> Counters:
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201102272217_0017
>
>
> 2011-03-01 19:26:52,253 [main] INFO
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Failed!
> 2011-03-01 19:26:52,298 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - ERROR 2997: Unable to recreate exception from backed error:
> java.io.IOException: Deserialization error: could not instantiate
> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
> String]'
> Details at logfile: /home/danbri/twitter/pig_1299003251377.log
>
> Looking there, it seems something (not afaik the twitter-text nor my
> library) wants some Google Collections class.
>
> /home/danbri/twitter/pig_1299003251377.log
> ->
> java.io.IOException: Deserialization error: could not instantiate
> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
> String]'
> Caused by: java.lang.RuntimeException: could not instantiate
> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
> String]'
> Caused by: java.lang.reflect.InvocationTargetException
> Caused by: java.lang.NoClassDefFoundError: com/google/common/collect/Sets
> Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Sets
>
> Perhaps this is something wrong in our Pig setup. I'll get a .jar from
> http://code.google.com/p/google-collections/source/checkout and see if
> that'll fix it. [...] ... nope, added ./google-collect-snapshot.jar to
> PIG_CLASSPATH, but I'm getting the same behaviour.
>
> Investigating...
>
> Dan
>
> >> On Tue, Mar 1, 2011 at 5:57 AM, Dan Brickley <[email protected]> wrote:
> >>>
> >>> I'm trying to use InvokeForString to call a simple static method that
> >>> wraps
> http://mzsanford.github.com/twitter-text-java/docs/api/index.html
> >>> https://github.com/twitter/twitter-text-java ... specifically the
> >>> Extractor class extractURLs method.  In fact since the logical result
> >>> is a list of URLs perhaps I should be writing proper Pig-centric
> >>> wrapper that returns a tuple, but for now I thought a stringified list
> >>> would be ok for my immediate purposes. That purpose being pulling out
> >>> all the URLs from a corpus of tweets, so we can expand the bit.ly and
> >>> other short urls...
> >>>
> >>> So - I built the extra class (src below) and packaged it inside the
> >>> twitter-text jar, and verify it's in there and usable as follows:
> >>>
> >>> danbri$ java -cp
> >>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar
> >>> tv.notube.TwitterExtractor "hello http://example.com/
> >>> http://example.org/ world"
> >>> URLs: [http://example.com/, http://example.org/]
> >>>
> >>> Then from the same directory, I try run this as a Pig job:
> >>>
> >>> tw06 = load '/user/danbri/twitter/tweets2009-06.tab.txt.lzo' AS (
> >>> when: chararray, who: chararray, msg: chararray);
> >>> REGISTER twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar;
> >>> DEFINE ExtractURLs InvokeForString('tv.notube.TwitterExtractor.urls',
> >>> 'String');
> >>> urls = FOREACH tw06 GENERATE ExtractURLs(msg);
> >>> x = SAMPLE urls 0.001;
> >>> dump x;
> >>>
> >>> ...but we don't get past InvokeForString,
> >>>
> >>> 2011-03-01 14:50:31,033 [main] ERROR org.apache.pig.tools.grunt.Grunt
> >>> - ERROR 1000: Error during parsing. could not instantiate
> >>> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
> >>> String]'
> >>> Details at logfile: /home/danbri/twitter/pig_1298987430385.log
> >>> ...->
> >>> Caused by: java.lang.reflect.InvocationTargetException
> >>> Caused by: java.lang.ClassNotFoundException: tv.notube.TwitterExtractor
> >>>
> >>> I checked that Pig is finding the jar by mis-spelling the filename in
> >>> the "REGISTER" line (which as expected causes things to fail earlier).
> >>> Also double-check that the class is in the jar,
> >>> danbri$ jar -tvf
> >>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar | grep tv
> >>>     0 Tue Mar 01 12:03:04 CET 2011 tv/
> >>>     0 Tue Mar 01 12:03:04 CET 2011 tv/notube/
> >>>  1114 Tue Mar 01 13:40:30 CET 2011 tv/notube/TwitterExtractor.class
> >>>
> >>> ...so I'm finding myself stuck. I'm sure the answer is staring me in
> >>> the face, but I can't see it. Perhaps I should just do things properly
> >>> with "extends EvalFunc<String>" and return the tuples separately
> >>> anyway...
> >>>
> >>> Thanks for any pointers,
> >>>
> >>> Dan
> >>>
> >>>
> >>> package tv.notube;
> >>> import com.twitter.Extractor;
> >>> import java.util.List;
> >>> class TwitterExtractor {
> >>>
> >>>  public static void main (String[] args) {
> >>>    String in = args[0];
> >>>        System.out.println("URLs: " + urls(in));
> >>>  }
> >>>
> >>>  public static String urls(String tweet) {
> >>>    Extractor ex = new Extractor();
> >>>    List urls = ex.extractURLs(tweet);
> >>>    String o = urls.toString();
> >>>    return o;
> >>>  }
> >>> }
> >>
> >>
> >
>

Re: Using dynamic invokers (InvokeForString)

Reply via email to