Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
the value is the un-tokenized text of each document. By default the
CollocDriver expects tokenized text as input, but if you add the '-p'
option to the CollocDriver command-line it will tokenize the text
before generating the collocations, so you can use the output of
seqdirectory as is.

for example:

./bin/mahout seqdirectory \
 -i ./examples/bin/work/reuters-out/ \
 -o ./examples/bin/work/reuters-out-seqdir \
 -c UTF-8 -chunk 5

./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
  -i ./examples/bin/work/reuters-out-seqdir \
  -o ./examples/bin/work/reuters-colloc-2 \
  -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p

Drew

On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni <[email protected]> wrote:
> Hi Drew,
>  Thanks for the tips - much appreciated. See inline.
>
> On 01/23/2011 09:22 AM, Drew Farris wrote:
>>
>> Hi Darren,
>>
>>  From the error message you receive, it is not exactly clear what is
>> happening here. I suppose it could be due to the format of the input
>> sequence file, but I'm not certain.
>>
>> A couple questions that will help me answer your question:
>>
>> 1) What version of Mahout are you using?
>
> 0.4
>>
>> 2) How are you generating the sequence file you are using as input to
>> the CollocDriver?
>
> bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
>
> Then I run:
>
> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
> out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>
> I am not running hadoop. The error is repeatable. Here is the full output.
> -----------
> no HADOOP_HOME set, running locally
> Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 317 ms
> [darren@cobalt mahout-distribution-0.4]$ bin/mahout
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
> phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
> no HADOOP_HOME set, running locally
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
> found on classpath, will use command-line arguments only
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Command line arguments:
> {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
> --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
> --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
> --tempDir=temp}
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Maximum n-gram size is: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Minimum Support value: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Minimum LLR value: 1.0
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Number of pass1 reduce tasks: 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Input will NOT be preprocessed
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> Jan 23, 2011 10:42:56 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 1
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Running job: job_local_0001
> Jan 23, 2011 10:42:56 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 1
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: io.sort.mb = 100
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: data buffer = 79691776/99614720
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> <init>
> INFO: record buffer = 262144/327680
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Max Ngram size is 2
> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Emit Unitgrams is false
> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0001
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.common.StringTuple
>    at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO:  map 0% reduce 0%
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Job complete: job_local_0001
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
> INFO: Counters: 0
> Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
> - already initialized
> Jan 23, 2011 10:42:57 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 0
> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Running job: job_local_0002
> Jan 23, 2011 10:42:58 AM
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
> INFO: Total input paths to process : 0
> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0002
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>    at java.util.ArrayList.get(ArrayList.java:322)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO:  map 0% reduce 0%
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO: Job complete: job_local_0002
> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
> INFO: Counters: 0
> Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 3064 ms
>
>> Using the latest code from trunk, I was able to run the following
>> sequence of commands on the data available after running
>> ./examples/bin/build-reuters.sh
>>
>> (All run from the mahout toplevel directory)
>>
>> ./bin/mahout seqdirectory \
>>   -i ./examples/bin/work/reuters-out/ \
>>   -o ./examples/bin/work/reuters-out-seqdir \
>>   -c UTF-8 -chunk 5 \
>>
>> ./bin/mahout seq2sparse \
>>   -i ./examples/bin/work/reuters-out-seqdir/ \
>>   -o ./examples/bin/work/reuters-out-seqdir-sparse \
>>
>> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>>   -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
>> \
>>   -o ./examples/bin/work/reuters-colloc \
>>   -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>
>>  ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>>
>> This produces output like:
>>
>> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.hadoop.io.DoubleWritable
>> Key: 0 0 25: Value: 18.436118042416638
>> Key: 0 0 zen: Value: 39.36827993847055
>>
>> Where the key is the trigram and the value is the llr score.
>>
>> If there are multiple parts in
>> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
>> them e.g:
>>
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>  out
>> ./bin/mahout seqdumper -s
>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>  out
>>
>> Running the results through 'sort -rm -k 6,6' will give you output
>> sorted by LLR score descending.
>>
>> HTH,
>>
>> Drew
>>
>> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<[email protected]>
>>  wrote:
>>>
>>> Hi,
>>>  I'm new to mahout and tried to research this a bit before encountering
>>> this
>>> problem.
>>>
>>> After I generate sequencefile for directory of text files, I run this:
>>>
>>>  bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
>>> 3
>>>
>>> It produces a couple exceptions:
>>> ...
>>> WARNING: job_local_0001
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>> org.apache.mahout.common.StringTuple
>>>    at
>>>
>>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>    at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>>> monitorAndPrintJob
>>> ...
>>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>>    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>    at java.util.ArrayList.get(ArrayList.java:322)
>>>    at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>>
>>> How can I make this work?
>>>
>>> Thanks for any tips,
>>> Darren
>>>
>
>

Reply via email to