On 1 March 2011 18:02, Dan Brickley <[email protected]> wrote:
> On 1 March 2011 17:56, Dmitriy Ryaboy <[email protected]> wrote:
>> Hi Dan,
>> iirc, registering a jar does not put it on the Pig client classpath, it just
>> tells Pig to ship the jar. You want to put it on the PIG_CLASSPATH before
>> you invoke pig.
>
> Perfect, that was exactly it. It's running now :)

That'll teach me to use words like "perfect". I get to run a job now,
however ...

2011-03-01 19:25:54,167 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: FILTER
2011-03-01 19:25:54,167 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2011-03-01 19:25:54,235 [main] INFO
org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns
pruned for tw06: $0, $1
2011-03-01 19:25:54,241 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
(Name: urls: 
Store(hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477:org.apache.pig.impl.io.InterStorage)
- scope-27 Operator Key: scope-27)
2011-03-01 19:25:54,242 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
2011-03-01 19:25:54,243 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2011-03-01 19:25:54,243 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2011-03-01 19:25:54,247 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2011-03-01 19:25:54,247 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to
default 0.3
2011-03-01 19:25:56,254 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2011-03-01 19:25:56,261 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2011-03-01 19:25:56,508 [Thread-19] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2011-03-01 19:25:56,509 [Thread-19] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1
2011-03-01 19:25:56,516 [Thread-19] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 1
2011-03-01 19:25:56,763 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2011-03-01 19:25:57,395 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201102272217_0017
2011-03-01 19:25:57,395 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at:
http://node1.hdfs-hadoop.sara.nl:50030/jobdetails.jsp?jobid=job_201102272217_0017
2011-03-01 19:26:52,239 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_201102272217_0017 has failed! Stop running all dependent
jobs
2011-03-01 19:26:52,241 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2011-03-01 19:26:52,252 [main] ERROR
org.apache.pig.tools.pigstats.PigStats - ERROR 2997: Unable to
recreate exception from backed error: java.io.IOException:
Deserialization error: could not instantiate 'InvokeForString' with
arguments '[tv.notube.TwitterExtractor.urls, String]'
2011-03-01 19:26:52,252 [main] ERROR
org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s)
failed!
2011-03-01 19:26:52,253 [main] INFO
org.apache.pig.tools.pigstats.PigStats - Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
0.20.2-CDH3B4   0.8.0-CDH3B4    danbri  2011-03-01 19:25:54
2011-03-01 19:26:52     FILTER

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201102272217_0017   tw06,urls,x     MAP_ONLY        Message: Job
failed! Error - NA
hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477,

Input(s):
Failed to read data from "/user/danbri/twitter/tweets2009-06.tab.txt.lzo"

Output(s):
Failed to produce result in
"hdfs://node1.hdfs-hadoop.sara.nl/tmp/temp-904547330/tmp1348346477"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201102272217_0017


2011-03-01 19:26:52,253 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed!
2011-03-01 19:26:52,298 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 2997: Unable to recreate exception from backed error:
java.io.IOException: Deserialization error: could not instantiate
'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
String]'
Details at logfile: /home/danbri/twitter/pig_1299003251377.log

Looking there, it seems something (not afaik the twitter-text nor my
library) wants some Google Collections class.

/home/danbri/twitter/pig_1299003251377.log
->
java.io.IOException: Deserialization error: could not instantiate
'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
String]'
Caused by: java.lang.RuntimeException: could not instantiate
'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
String]'
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.lang.NoClassDefFoundError: com/google/common/collect/Sets
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Sets

Perhaps this is something wrong in our Pig setup. I'll get a .jar from
http://code.google.com/p/google-collections/source/checkout and see if
that'll fix it. [...] ... nope, added ./google-collect-snapshot.jar to
PIG_CLASSPATH, but I'm getting the same behaviour.

Investigating...

Dan

>> On Tue, Mar 1, 2011 at 5:57 AM, Dan Brickley <[email protected]> wrote:
>>>
>>> I'm trying to use InvokeForString to call a simple static method that
>>> wraps http://mzsanford.github.com/twitter-text-java/docs/api/index.html
>>> https://github.com/twitter/twitter-text-java ... specifically the
>>> Extractor class extractURLs method.  In fact since the logical result
>>> is a list of URLs perhaps I should be writing proper Pig-centric
>>> wrapper that returns a tuple, but for now I thought a stringified list
>>> would be ok for my immediate purposes. That purpose being pulling out
>>> all the URLs from a corpus of tweets, so we can expand the bit.ly and
>>> other short urls...
>>>
>>> So - I built the extra class (src below) and packaged it inside the
>>> twitter-text jar, and verify it's in there and usable as follows:
>>>
>>> danbri$ java -cp
>>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar
>>> tv.notube.TwitterExtractor "hello http://example.com/
>>> http://example.org/ world"
>>> URLs: [http://example.com/, http://example.org/]
>>>
>>> Then from the same directory, I try run this as a Pig job:
>>>
>>> tw06 = load '/user/danbri/twitter/tweets2009-06.tab.txt.lzo' AS (
>>> when: chararray, who: chararray, msg: chararray);
>>> REGISTER twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar;
>>> DEFINE ExtractURLs InvokeForString('tv.notube.TwitterExtractor.urls',
>>> 'String');
>>> urls = FOREACH tw06 GENERATE ExtractURLs(msg);
>>> x = SAMPLE urls 0.001;
>>> dump x;
>>>
>>> ...but we don't get past InvokeForString,
>>>
>>> 2011-03-01 14:50:31,033 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>> - ERROR 1000: Error during parsing. could not instantiate
>>> 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls,
>>> String]'
>>> Details at logfile: /home/danbri/twitter/pig_1298987430385.log
>>> ...->
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> Caused by: java.lang.ClassNotFoundException: tv.notube.TwitterExtractor
>>>
>>> I checked that Pig is finding the jar by mis-spelling the filename in
>>> the "REGISTER" line (which as expected causes things to fail earlier).
>>> Also double-check that the class is in the jar,
>>> danbri$ jar -tvf
>>> twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar | grep tv
>>>     0 Tue Mar 01 12:03:04 CET 2011 tv/
>>>     0 Tue Mar 01 12:03:04 CET 2011 tv/notube/
>>>  1114 Tue Mar 01 13:40:30 CET 2011 tv/notube/TwitterExtractor.class
>>>
>>> ...so I'm finding myself stuck. I'm sure the answer is staring me in
>>> the face, but I can't see it. Perhaps I should just do things properly
>>> with "extends EvalFunc<String>" and return the tuples separately
>>> anyway...
>>>
>>> Thanks for any pointers,
>>>
>>> Dan
>>>
>>>
>>> package tv.notube;
>>> import com.twitter.Extractor;
>>> import java.util.List;
>>> class TwitterExtractor {
>>>
>>>  public static void main (String[] args) {
>>>    String in = args[0];
>>>        System.out.println("URLs: " + urls(in));
>>>  }
>>>
>>>  public static String urls(String tweet) {
>>>    Extractor ex = new Extractor();
>>>    List urls = ex.extractURLs(tweet);
>>>    String o = urls.toString();
>>>    return o;
>>>  }
>>> }
>>
>>
>

Reply via email to