Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where the value is the un-tokenized text of each document. By default the CollocDriver expects tokenized text as input, but if you add the '-p' option to the CollocDriver command-line it will tokenize the text before generating the collocations, so you can use the output of seqdirectory as is.
for example: ./bin/mahout seqdirectory \ -i ./examples/bin/work/reuters-out/ \ -o ./examples/bin/work/reuters-out-seqdir \ -c UTF-8 -chunk 5 ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \ -i ./examples/bin/work/reuters-out-seqdir \ -o ./examples/bin/work/reuters-colloc-2 \ -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p Drew On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni <[email protected]> wrote: > Hi Drew, > Thanks for the tips - much appreciated. See inline. > > On 01/23/2011 09:22 AM, Drew Farris wrote: >> >> Hi Darren, >> >> From the error message you receive, it is not exactly clear what is >> happening here. I suppose it could be due to the format of the input >> sequence file, but I'm not certain. >> >> A couple questions that will help me answer your question: >> >> 1) What version of Mahout are you using? > > 0.4 >> >> 2) How are you generating the sequence file you are using as input to >> the CollocDriver? > > bin/mahout seqdirectory --charset ascii --input textfiles/ --output out > > Then I run: > > bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i > out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer > > I am not running hadoop. The error is repeatable. Here is the full output. > ----------- > no HADOOP_HOME set, running locally > Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Program took 317 ms > [darren@cobalt mahout-distribution-0.4]$ bin/mahout > org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o > phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer > no HADOOP_HOME set, running locally > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn > WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props > found on classpath, will use command-line arguments only > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Command line arguments: > {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer, > --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2, > --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0, > --tempDir=temp} > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Maximum n-gram size is: 2 > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Minimum Support value: 2 > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Minimum LLR value: 1.0 > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Number of pass1 reduce tasks: 2 > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Input will NOT be preprocessed > Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= > Jan 23, 2011 10:42:56 AM > org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus > INFO: Total input paths to process : 1 > Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > INFO: Running job: job_local_0001 > Jan 23, 2011 10:42:56 AM > org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus > INFO: Total input paths to process : 1 > Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer > <init> > INFO: io.sort.mb = 100 > Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer > <init> > INFO: data buffer = 79691776/99614720 > Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer > <init> > INFO: record buffer = 262144/327680 > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Max Ngram size is 2 > Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Emit Unitgrams is false > Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run > WARNING: job_local_0001 > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > org.apache.mahout.common.StringTuple > at > org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > INFO: map 0% reduce 0% > Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > INFO: Job complete: job_local_0001 > Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log > INFO: Counters: 0 > Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init > INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= > - already initialized > Jan 23, 2011 10:42:57 AM > org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus > INFO: Total input paths to process : 0 > Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > INFO: Running job: job_local_0002 > Jan 23, 2011 10:42:58 AM > org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus > INFO: Total input paths to process : 0 > Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run > WARNING: job_local_0002 > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124) > Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > INFO: map 0% reduce 0% > Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient > monitorAndPrintJob > INFO: Job complete: job_local_0002 > Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log > INFO: Counters: 0 > Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info > INFO: Program took 3064 ms > >> Using the latest code from trunk, I was able to run the following >> sequence of commands on the data available after running >> ./examples/bin/build-reuters.sh >> >> (All run from the mahout toplevel directory) >> >> ./bin/mahout seqdirectory \ >> -i ./examples/bin/work/reuters-out/ \ >> -o ./examples/bin/work/reuters-out-seqdir \ >> -c UTF-8 -chunk 5 \ >> >> ./bin/mahout seq2sparse \ >> -i ./examples/bin/work/reuters-out-seqdir/ \ >> -o ./examples/bin/work/reuters-out-seqdir-sparse \ >> >> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \ >> -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o >> \ >> -o ./examples/bin/work/reuters-colloc \ >> -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 >> >> ./bin/mahout seqdumper -s >> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less >> >> This produces output like: >> >> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000 >> Key class: class org.apache.hadoop.io.Text Value Class: class >> org.apache.hadoop.io.DoubleWritable >> Key: 0 0 25: Value: 18.436118042416638 >> Key: 0 0 zen: Value: 39.36827993847055 >> >> Where the key is the trigram and the value is the llr score. >> >> If there are multiple parts in >> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate >> them e.g: >> >> ./bin/mahout seqdumper -s >> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>> out >> ./bin/mahout seqdumper -s >> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>> out >> >> Running the results through 'sort -rm -k 6,6' will give you output >> sorted by LLR score descending. >> >> HTH, >> >> Drew >> >> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<[email protected]> >> wrote: >>> >>> Hi, >>> I'm new to mahout and tried to research this a bit before encountering >>> this >>> problem. >>> >>> After I generate sequencefile for directory of text files, I run this: >>> >>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i >>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng >>> 3 >>> >>> It produces a couple exceptions: >>> ... >>> WARNING: job_local_0001 >>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to >>> org.apache.mahout.common.StringTuple >>> at >>> >>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) >>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient >>> monitorAndPrintJob >>> ... >>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0 >>> at java.util.ArrayList.RangeCheck(ArrayList.java:547) >>> at java.util.ArrayList.get(ArrayList.java:322) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124) >>> >>> How can I make this work? >>> >>> Thanks for any tips, >>> Darren >>> > >
