Hi Darren, >From the error message you receive, it is not exactly clear what is happening here. I suppose it could be due to the format of the input sequence file, but I'm not certain.
A couple questions that will help me answer your question: 1) What version of Mahout are you using? 2) How are you generating the sequence file you are using as input to the CollocDriver? Using the latest code from trunk, I was able to run the following sequence of commands on the data available after running ./examples/bin/build-reuters.sh (All run from the mahout toplevel directory) ./bin/mahout seqdirectory \ -i ./examples/bin/work/reuters-out/ \ -o ./examples/bin/work/reuters-out-seqdir \ -c UTF-8 -chunk 5 \ ./bin/mahout seq2sparse \ -i ./examples/bin/work/reuters-out-seqdir/ \ -o ./examples/bin/work/reuters-out-seqdir-sparse \ ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \ -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \ -o ./examples/bin/work/reuters-colloc \ -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 ./bin/mahout seqdumper -s ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less This produces output like: Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.DoubleWritable Key: 0 0 25: Value: 18.436118042416638 Key: 0 0 zen: Value: 39.36827993847055 Where the key is the trigram and the value is the llr score. If there are multiple parts in examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate them e.g: ./bin/mahout seqdumper -s ./examples/bin/work/reuters-colloc/ngrams/part-r-00000 >> out ./bin/mahout seqdumper -s ./examples/bin/work/reuters-colloc/ngrams/part-r-00001 >> out Running the results through 'sort -rm -k 6,6' will give you output sorted by LLR score descending. HTH, Drew On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni <[email protected]> wrote: > Hi, > I'm new to mahout and tried to research this a bit before encountering this > problem. > > After I generate sequencefile for directory of text files, I run this: > > bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i > out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 > > It produces a couple exceptions: > ... > WARNING: job_local_0001 > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.mahout.common.StringTuple > at > org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > ... > ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124) > > How can I make this work? > > Thanks for any tips, > Darren >
