Drew,
  Thanks for the tip. It works great now!

Darren

PS. the sort command you suggested doesn't quite sort by LLR score
because its only a lexical sort and misses something like 70.000 should be greater than 8.000


On 01/23/2011 11:59 AM, Drew Farris wrote:
Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
the value is the un-tokenized text of each document. By default the
CollocDriver expects tokenized text as input, but if you add the '-p'
option to the CollocDriver command-line it will tokenize the text
before generating the collocations, so you can use the output of
seqdirectory as is.

for example:

./bin/mahout seqdirectory \
  -i ./examples/bin/work/reuters-out/ \
  -o ./examples/bin/work/reuters-out-seqdir \
  -c UTF-8 -chunk 5

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
   -i ./examples/bin/work/reuters-out-seqdir \
   -o ./examples/bin/work/reuters-colloc-2 \
   -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p

Drew

On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni<[email protected]>  wrote:
Hi Drew,
  Thanks for the tips - much appreciated. See inline.

On 01/23/2011 09:22 AM, Drew Farris wrote:
Hi Darren,

  From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.

A couple questions that will help me answer your question:

1) What version of Mahout are you using?
0.4
2) How are you generating the sequence file you are using as input to
the CollocDriver?
bin/mahout seqdirectory --charset ascii --input textfiles/ --output out

Then I run:

bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer

I am not running hadoop. The error is repeatable. Here is the full output.
-----------
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 317 ms
[darren@cobalt mahout-distribution-0.4]$ bin/mahout
org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
no HADOOP_HOME set, running locally
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
found on classpath, will use command-line arguments only
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments:
{--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
--endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
--minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
--tempDir=temp}
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Maximum n-gram size is: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum Support value: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Minimum LLR value: 1.0
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Number of pass1 reduce tasks: 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Input will NOT be preprocessed
Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Running job: job_local_0001
Jan 23, 2011 10:42:56 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 1
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
<init>
INFO: io.sort.mb = 100
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
<init>
INFO: data buffer = 79691776/99614720
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
<init>
INFO: record buffer = 262144/327680
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Max Ngram size is 2
Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Emit Unitgrams is false
Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.mahout.common.StringTuple
    at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO:  map 0% reduce 0%
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Job complete: job_local_0001
Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
- already initialized
Jan 23, 2011 10:42:57 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Running job: job_local_0002
Jan 23, 2011 10:42:58 AM
org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 0
Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0002
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO:  map 0% reduce 0%
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO: Job complete: job_local_0002
Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 3064 ms

Using the latest code from trunk, I was able to run the following
sequence of commands on the data available after running
./examples/bin/build-reuters.sh

(All run from the mahout toplevel directory)

./bin/mahout seqdirectory \
   -i ./examples/bin/work/reuters-out/ \
   -o ./examples/bin/work/reuters-out-seqdir \
   -c UTF-8 -chunk 5 \

./bin/mahout seq2sparse \
   -i ./examples/bin/work/reuters-out-seqdir/ \
   -o ./examples/bin/work/reuters-out-seqdir-sparse \

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
   -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
\
   -o ./examples/bin/work/reuters-colloc \
   -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3

  ./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less

This produces output like:

Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0 0 25: Value: 18.436118042416638
Key: 0 0 zen: Value: 39.36827993847055

Where the key is the trigram and the value is the llr score.

If there are multiple parts in
examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
them e.g:

./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>    out
./bin/mahout seqdumper -s
./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>    out

Running the results through 'sort -rm -k 6,6' will give you output
sorted by LLR score descending.

HTH,

Drew

On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<[email protected]>
  wrote:
Hi,
  I'm new to mahout and tried to research this a bit before encountering
this
problem.

After I generate sequencefile for directory of text files, I run this:

  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
3

It produces a couple exceptions:
...
WARNING: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.mahout.common.StringTuple
    at

org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
...
ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)

How can I make this work?

Thanks for any tips,
Darren



Reply via email to