OK that did work for mahout thanks!, but now hadoop cannot load the class, even
when
the jar containing it has been added to the hadoop classpath
hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ echo $HADOOP_CLASSPATH
/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-core-3.0.2.jar:/home/camilo/mahout-distribution-0.4/utils/target/dependency/lucene-analyzers-3.0.2.jar:/home/hadoop/my_analyzer.jar
I get:
hadoop@ubuntu:/home/camilo/mahout-distribution-0.4$ bin/mahout seq2sparse -i
/htmless_articles_seq -o /htmless_articles_vectors_2 -wt tfidf -a
com.my.analyzers.MyAnalyzer
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/04/21 13:39:33 WARN driver.MahoutDriver: No seq2sparse.props found on
classpath, will use command-line arguments only
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
n-gram size is: 3
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR
value: 1.0
11/04/21 13:39:33 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
reduce tasks: 1
11/04/21 13:39:33 INFO common.HadoopUtil: Deleting /htmless_articles_vectors_2
11/04/21 13:39:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
11/04/21 13:39:33 INFO input.FileInputFormat: Total input paths to process : 1
11/04/21 13:39:33 INFO mapred.JobClient: Running job: job_201104211109_0038
11/04/21 13:39:34 INFO mapred.JobClient: map 0% reduce 0%
11/04/21 13:39:43 INFO mapred.JobClient: Task Id :
attempt_201104211109_0038_m_000000_0, Status : FAILED
java.lang.IllegalStateException: java.lang.ClassNotFoundException:
com.my.analyzers.MyAnalyzer
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:61)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.ClassNotFoundException: com.my.analyzers.MyAnalyzer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:57)
... 4 more
Is there anything I'm missing there?
On 2011-04-20, at 1:32 PM, Ian Helmke wrote:
> Yes, if you make a subclass of StandardAnalyzer or your own Analyzer
> that has a constructor with no arguments (presumably which calls a
> superclass constructor with the arguments you want), that should work
> nicely. (You could also just add a zero-argument constructor to your
> own custom analyzer.)
>
> On Wed, Apr 20, 2011 at 1:25 PM, Camilo Lopez <[email protected]> wrote:
>> Ian,
>>
>> Using 3.0.x ( the one that comes by default in Mahouts trunk now),
>> by nullary consstructor you mean I should overload the constructor to receive
>> no args in my own custom class?
>>
>>
>> On 2011-04-20, at 1:23 PM, Ian Helmke wrote:
>>
>>> What version of lucene are you using? If you use lucene 3.0 or later,
>>> you can't use StandardAnalyzer as-is because it has no no-args
>>> constructor. You could try the mahout DefaultAnalyzer (which wraps the
>>> lucene analyzer in a no-argument constructor). I have gotten custom
>>> analyzers to work, but they need to have a nullary constructor.
>>>
>>>
>>> On Wed, Apr 20, 2011 at 12:58 PM, Camilo Lopez <[email protected]>
>>> wrote:
>>>> Hi List,
>>>>
>>>> Trying to run custom analizer classes I'm always getting
>>>> InstantiationException, at first I suspected my own code, but trying with
>>>> what is supposed to be the default value
>>>> 'org.apache.lucene.analysis.standard.StandardAnalyzer' I still get the
>>>> same exception.
>>>>
>>>> This is the command
>>>>
>>>> bin/mahout seq2sparse -i /htmless_articles_seq -o
>>>> /htmless_articles_vectors_1 -ng 3 -x35 -wt tfidf -a
>>>> org.apache.lucene.analysis.standard.StandardAnalyzer -nv
>>>>
>>>>
>>>> Looking a little deeper (ie catching the InstantiationException and
>>>> throwing getCause()) InstantiationException in turns out the problem is
>>>> caused by a NullPointerException
>>>>
>>>> Exception in thread "main" java.lang.NullPointerException
>>>> at
>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:211)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>> at
>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:52)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>> at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>
>>>>
>>>> Am I missing something, is there another way to create/use custom
>>>> analyzers in seq2sparse?
>>>>
>>>>
>>>>
>>
>>