Hi Darren,

>From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.

A couple questions that will help me answer your question:

1) What version of Mahout are you using?
2) How are you generating the sequence file you are using as input to
the CollocDriver?

Using the latest code from trunk, I was able to run the following
sequence of commands on the data available after running
./examples/bin/build-reuters.sh

(All run from the mahout toplevel directory)

./bin/mahout seqdirectory \
  -i ./examples/bin/work/reuters-out/ \
  -o ./examples/bin/work/reuters-out-seqdir \
  -c UTF-8 -chunk 5 \

./bin/mahout seq2sparse \
  -i ./examples/bin/work/reuters-out-seqdir/ \
  -o ./examples/bin/work/reuters-out-seqdir-sparse \

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
  -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
  -o ./examples/bin/work/reuters-colloc \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3

 ./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less

This produces output like:

Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0 0 25: Value: 18.436118042416638
Key: 0 0 zen: Value: 39.36827993847055

Where the key is the trigram and the value is the llr score.

If there are multiple parts in
examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
them e.g:

./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00000 >> out
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00001 >> out

Running the results through 'sort -rm -k 6,6' will give you output
sorted by LLR score descending.

HTH,

Drew

On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni <[email protected]> wrote:
> Hi,
>  I'm new to mahout and tried to research this a bit before encountering this
> problem.
>
> After I generate sequencefile for directory of text files, I run this:
>
>  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>
> It produces a couple exceptions:
> ...
> WARNING: job_local_0001
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.common.StringTuple
>    at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> ...
> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>    at java.util.ArrayList.get(ArrayList.java:322)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>
> How can I make this work?
>
> Thanks for any tips,
> Darren
>

Reply via email to