I'm trying to use InvokeForString to call a simple static method that
wraps http://mzsanford.github.com/twitter-text-java/docs/api/index.html
https://github.com/twitter/twitter-text-java ... specifically the
Extractor class extractURLs method.  In fact since the logical result
is a list of URLs perhaps I should be writing proper Pig-centric
wrapper that returns a tuple, but for now I thought a stringified list
would be ok for my immediate purposes. That purpose being pulling out
all the URLs from a corpus of tweets, so we can expand the bit.ly and
other short urls...

So - I built the extra class (src below) and packaged it inside the
twitter-text jar, and verify it's in there and usable as follows:

danbri$ java -cp
twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar
tv.notube.TwitterExtractor "hello http://example.com/
http://example.org/ world"
URLs: [http://example.com/, http://example.org/]

Then from the same directory, I try run this as a Pig job:

tw06 = load '/user/danbri/twitter/tweets2009-06.tab.txt.lzo' AS (
when: chararray, who: chararray, msg: chararray);
REGISTER twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar;
DEFINE ExtractURLs InvokeForString('tv.notube.TwitterExtractor.urls', 'String');
urls = FOREACH tw06 GENERATE ExtractURLs(msg);
x = SAMPLE urls 0.001;
dump x;

...but we don't get past InvokeForString,

2011-03-01 14:50:31,033 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1000: Error during parsing. could not instantiate
'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
String]'
Details at logfile: /home/danbri/twitter/pig_1298987430385.log
...->
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.lang.ClassNotFoundException: tv.notube.TwitterExtractor

I checked that Pig is finding the jar by mis-spelling the filename in
the "REGISTER" line (which as expected causes things to fail earlier).
Also double-check that the class is in the jar,
danbri$ jar -tvf
twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar | grep tv
     0 Tue Mar 01 12:03:04 CET 2011 tv/
     0 Tue Mar 01 12:03:04 CET 2011 tv/notube/
  1114 Tue Mar 01 13:40:30 CET 2011 tv/notube/TwitterExtractor.class

...so I'm finding myself stuck. I'm sure the answer is staring me in
the face, but I can't see it. Perhaps I should just do things properly
with "extends EvalFunc<String>" and return the tuples separately
anyway...

Thanks for any pointers,

Dan


package tv.notube;
import com.twitter.Extractor;
import java.util.List;
class TwitterExtractor {

  public static void main (String[] args) {
    String in = args[0];
        System.out.println("URLs: " + urls(in));
  }

  public static String urls(String tweet) {
    Extractor ex = new Extractor();
    List urls = ex.extractURLs(tweet);
    String o = urls.toString();
    return o;
  }
}

Reply via email to