Hi Drew,
  Thanks for the tips - much appreciated. See inline.

On 01/23/2011 09:22 AM, Drew Farris wrote:
Hi Darren,

 From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.

A couple questions that will help me answer your question:

1) What version of Mahout are you using?
0.4
2) How are you generating the sequence file you are using as input to
the CollocDriver?
bin/mahout seqdirectory --charset ascii --input textfiles/ --output out

Then I run:

bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer

I am not running hadoop. The error is repeatable. Here is the full output.
-----------
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 317 ms
[darren@cobalt mahout-distribution-0.4]$ bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props found on classpath, will use command-line arguments only
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer, --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2, --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0, --tempDir=temp}
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Maximum n-gram size is: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum Support value: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum LLR value: 1.0
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Number of pass1 reduce tasks: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input will NOT be preprocessed
Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: io.sort.mb = 100
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: data buffer = 79691776/99614720
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
INFO: record buffer = 262144/327680
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Max Ngram size is 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Emit Unitgrams is false
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.common.StringTuple at org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0002
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0002
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124) Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0002
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 3064 ms

Using the latest code from trunk, I was able to run the following
sequence of commands on the data available after running
./examples/bin/build-reuters.sh

(All run from the mahout toplevel directory)

./bin/mahout seqdirectory \
   -i ./examples/bin/work/reuters-out/ \
   -o ./examples/bin/work/reuters-out-seqdir \
   -c UTF-8 -chunk 5 \

./bin/mahout seq2sparse \
   -i ./examples/bin/work/reuters-out-seqdir/ \
   -o ./examples/bin/work/reuters-out-seqdir-sparse \

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
   -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
   -o ./examples/bin/work/reuters-colloc \
   -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3

  ./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less

This produces output like:

Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0 0 25: Value: 18.436118042416638
Key: 0 0 zen: Value: 39.36827993847055

Where the key is the trigram and the value is the llr score.

If there are multiple parts in
examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
them e.g:

./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>  out
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>  out

Running the results through 'sort -rm -k 6,6' will give you output
sorted by LLR score descending.

HTH,

Drew

On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<[email protected]>  wrote:
Hi,
  I'm new to mahout and tried to research this a bit before encountering this
problem.

After I generate sequencefile for directory of text files, I run this:

  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3

It produces a couple exceptions:
...
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.mahout.common.StringTuple
    at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
...
ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)

How can I make this work?

Thanks for any tips,
Darren


Reply via email to