On Jun 7, 2013, at 1:04 AM, Suneel Marthi <[email protected]> wrote:

> Grant,
> 
> Thinking loud here?  In light of the fix for Mahout-944 (lucene2seq utility) 
> which has been committed to trunk, do we still need to maintain lucene.vector?
> 
> The path then would be lucene2seq -> seq2sparse -> rowid -> cvb.

lucene.vector will still give you higher performance at the cost of extra 
storage (and the fact that it doesn't work in M/R and can't handle multiple 
directories).

I'd say we keep it for now.


> 
> 
> 
> 
> ________________________________
> From: Grant Ingersoll <[email protected]>
> To: [email protected]; James Forth <[email protected]> 
> Sent: Wednesday, June 5, 2013 10:46 AM
> Subject: Re: Dictionary file format in Lucene-Mahout integration
> 
> 
> {code}
> File dictOutFile = new File(dictOut);
>     log.info("Dictionary Output file: {}", dictOutFile);
>     Writer writer = Files.newWriter(dictOutFile, Charsets.UTF_8);
>     DelimitedTermInfoWriter tiWriter = new DelimitedTermInfoWriter(writer, 
> delimiter, field);
>     try {
>       tiWriter.write(termInfo);
>     } finally {
>       Closeables.closeQuietly(tiWriter);
>     }
> {code}
> 
> Is the culprit in the Lucene Driver class.  The way to fix this would be to 
> abstract the writer and allow it to use other implementations, namely one 
> that supported the seq 2 sparse format.
> 
> Any chance you are up for patching it James?
> 
> -Grant
> 
> On Jun 5, 2013, at 2:00 AM, James Forth <[email protected]> wrote:
> 
>> Hello,
>> 
>> 
>> I’m wondering if anyone can help with a question about the dictionary format 
>> in
>> lucene.vector-cvb integration.  I’ve previously used the pathway from text
>> files:  seqdirectory >
>> seq2sparse > rowid > cvb  and it works fine.  The
>> dictionary created by seq2sparse is in sequence file format, and this is 
>> accepted by cvb.
>> 
>> But when using a pathway from a lucene index:  lucene.vector > cvb  there is 
>> a problem with cvb throwing the error “dict.out not a SequenceFile”. 
>> Lucene.vector appears to generate a dictionary in plain text format, but cvb
>> requires it in sequence file format.
>> 
>> Does anyone know how to use lucence.vector with cvb, which I assume means
>> obtaining a dictionary as a sequence file from lucene.vector?
>> 
>> Thanks for your help.
>> 
>> James
> 
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com





Reply via email to