Hi Peter, Apologies for the delay in following up on this.
The error you're seeing is as a result of the output of the lucene.vector task being incompatible with the input of CollocDriver. They each write and read sequence files containing different types of data. The sequence files produced by the lucene.vector task have a key type of LongWritable and a value type of VectorWritable, the vectors themselves do not retain references to the word position in the original document, but merely encode the word as an id from the dictionary that is generated as a part of the process and a weight based upon the occurrences of the term in the index and the parameters you specified when you called lucene.vector. As positional information is not retained you can not use these vectors as input to the CollocDriver code which relies on word proximity to form collocations. The CollocDriver expects the sequence files it uses as input to have a key type of Text and value type of StringTuple. Sequence Files with a key of Text and value of Text are also acceptable if the preprocess option is specified. In either case the key is document id, while the value is the text of a document either tokenized in StringTuple form or untokenized in Text form. To produce sequence files suitable for generating collocations from a Lucene index, you'll need to write some code pull the text from a stored field or re-construct the text from a term vector with positional information. You can then write this to a sequence file that will work with CollocDriver. The org.apache.mahout.utils.vectors.lucene.Driver class is a good starting point for learning how to extract data from a Lucene index and write data to sequence files. Drew On Fri, Jun 10, 2011 at 5:03 PM, Peter Andrews <[email protected]> wrote: > Hi, > > I just started using Mahout a few or two ago and so far its been pretty > good. I working on some term collocation and while I have been working from > a directory of files, I want to switch to using lucene indexes as that is > the format the files are already in. I am trying to use the lucene.vector to > turn the indexes into vector and then use > org.apache.mahout.vectorizer.collocations.llr.CollocDriver to generate the > collocations and LLRs. I keep getting this error when I run CollocDriver, > any ideas? > > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be > cast to org.apache.hadoop.io.Text > at > org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:40) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at org.apache.hadoop.mapred.Child$4.run(Child.java:259) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:253) > > > -- > Peter Andrews >
