Re: term collocation from lucene index

Drew Farris Mon, 13 Jun 2011 20:34:14 -0700

Hi Peter,

Apologies for the delay in following up on this.

The error you're seeing is as a result of the output of the
lucene.vector task being incompatible with the input of CollocDriver.
They each write and read sequence files containing different types of
data.

The sequence files produced by the lucene.vector task have a key type
of LongWritable and a value type of VectorWritable, the vectors
themselves do not retain references to the word position in the
original document, but merely encode the word as an id from the
dictionary that is generated as a part of the process and a weight
based upon the occurrences of the term in the index and the parameters
you specified when you called lucene.vector. As positional information
is not retained you can not use these vectors as input to the
CollocDriver code which relies on word proximity to form collocations.

The CollocDriver expects the sequence files it uses as input to have a
key type of Text and value type of StringTuple. Sequence Files with a
key of Text and value of Text are also acceptable if the preprocess
option is specified. In either case the key is document id, while the
value is the text of a document either tokenized in StringTuple form
or untokenized in Text form.

To produce sequence files suitable for generating collocations from a
Lucene index, you'll need to write some code pull the text from a
stored field or re-construct the text from a term vector with
positional information. You can then write this to a sequence file
that will work with CollocDriver. The
org.apache.mahout.utils.vectors.lucene.Driver class is a good starting
point for learning how to extract data from a Lucene index and write
data to sequence files.

Drew

On Fri, Jun 10, 2011 at 5:03 PM, Peter Andrews <[email protected]> wrote:
> Hi,
>
> I just started using Mahout a few or two ago and so far its been pretty
> good. I working on some term collocation and while I have been working from
> a directory of files, I want to switch to using lucene indexes as that is
> the format the files are already in. I am trying to use the lucene.vector to
> turn the indexes into vector and then use
> org.apache.mahout.vectorizer.collocations.llr.CollocDriver to generate the
> collocations and LLRs. I keep getting this error when I run CollocDriver,
> any ideas?
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> cast to org.apache.hadoop.io.Text
> at
> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:40)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> at org.apache.hadoop.mapred.Child.main(Child.java:253)
>
>
> --
> Peter Andrews
>

Re: term collocation from lucene index

Reply via email to