On Jun 7, 2013, at 1:04 AM, Suneel Marthi <[email protected]> wrote:
> Grant, > > Thinking loud here? In light of the fix for Mahout-944 (lucene2seq utility) > which has been committed to trunk, do we still need to maintain lucene.vector? > > The path then would be lucene2seq -> seq2sparse -> rowid -> cvb. lucene.vector will still give you higher performance at the cost of extra storage (and the fact that it doesn't work in M/R and can't handle multiple directories). I'd say we keep it for now. > > > > > ________________________________ > From: Grant Ingersoll <[email protected]> > To: [email protected]; James Forth <[email protected]> > Sent: Wednesday, June 5, 2013 10:46 AM > Subject: Re: Dictionary file format in Lucene-Mahout integration > > > {code} > File dictOutFile = new File(dictOut); > log.info("Dictionary Output file: {}", dictOutFile); > Writer writer = Files.newWriter(dictOutFile, Charsets.UTF_8); > DelimitedTermInfoWriter tiWriter = new DelimitedTermInfoWriter(writer, > delimiter, field); > try { > tiWriter.write(termInfo); > } finally { > Closeables.closeQuietly(tiWriter); > } > {code} > > Is the culprit in the Lucene Driver class. The way to fix this would be to > abstract the writer and allow it to use other implementations, namely one > that supported the seq 2 sparse format. > > Any chance you are up for patching it James? > > -Grant > > On Jun 5, 2013, at 2:00 AM, James Forth <[email protected]> wrote: > >> Hello, >> >> >> I’m wondering if anyone can help with a question about the dictionary format >> in >> lucene.vector-cvb integration. I’ve previously used the pathway from text >> files: seqdirectory > >> seq2sparse > rowid > cvb and it works fine. The >> dictionary created by seq2sparse is in sequence file format, and this is >> accepted by cvb. >> >> But when using a pathway from a lucene index: lucene.vector > cvb there is >> a problem with cvb throwing the error “dict.out not a SequenceFile”. >> Lucene.vector appears to generate a dictionary in plain text format, but cvb >> requires it in sequence file format. >> >> Does anyone know how to use lucence.vector with cvb, which I assume means >> obtaining a dictionary as a sequence file from lucene.vector? >> >> Thanks for your help. >> >> James > > -------------------------------------------- > Grant Ingersoll | @gsingers > http://www.lucidworks.com -------------------------------------------- Grant Ingersoll | @gsingers http://www.lucidworks.com
