Re: Dictionary file format in Lucene-Mahout integration

Grant Ingersoll Wed, 05 Jun 2013 07:47:43 -0700

{code}
File dictOutFile = new File(dictOut);
    log.info("Dictionary Output file: {}", dictOutFile);
    Writer writer = Files.newWriter(dictOutFile, Charsets.UTF_8);
    DelimitedTermInfoWriter tiWriter = new DelimitedTermInfoWriter(writer, 
delimiter, field);
    try {
      tiWriter.write(termInfo);
    } finally {
      Closeables.closeQuietly(tiWriter);
    }
{code}


Is the culprit in the Lucene Driver class.  The way to fix this would be to 
abstract the writer and allow it to use other implementations, namely one that 
supported the seq 2 sparse format.

Any chance you are up for patching it James?

-Grant

On Jun 5, 2013, at 2:00 AM, James Forth <[email protected]> wrote:

> Hello,
> 
> 
> I’m wondering if anyone can help with a question about the dictionary format 
> in
> lucene.vector-cvb integration.  I’ve previously used the pathway from text
> files:  seqdirectory >
> seq2sparse > rowid > cvb  and it works fine.  The
> dictionary created by seq2sparse is in sequence file format, and this is 
> accepted by cvb.
> 
> But when using a pathway from a lucene index:  lucene.vector > cvb  there is 
> a problem with cvb throwing the error “dict.out not a SequenceFile”. 
> Lucene.vector appears to generate a dictionary in plain text format, but cvb
> requires it in sequence file format.
> 
> Does anyone know how to use lucence.vector with cvb, which I assume means
> obtaining a dictionary as a sequence file from lucene.vector?
> 
> Thanks for your help.
> 
> James

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: Dictionary file format in Lucene-Mahout integration

Reply via email to