Re: getting mahout clustering info back into lucene

Ken Krugler Sat, 05 Nov 2011 14:30:05 -0700

On Nov 5, 2011, at 7:06am, Grant Ingersoll wrote:

> 
> On Nov 5, 2011, at 8:36 AM, Robert Stewart wrote:
> 
>> If I run mahout clustering on lucene vectors, how would I go about getting 
>> that cluster information back into lucene, in order to use the cluster 
>> identifiers in field collapsing?
>> 
> 
> Since Lucene doesn't have incremental field update (which is seriously 
> non-trivial to do in an inverted index), the only way to do this is to 
> re-index.  Once DocValues are updateable, this may be a lot easier.   You 
> could, also, perhaps use the ParallelReader, but that has some restrictions 
> (you have to keep docids in sync)
> 
>> I know I can re-index with the new cluster info, but is there any way to put 
>> cluster info into an existing index (which also may be non-optimized and 
>> quite large)?  One way maybe to have a custom field collapsing component 
>> that can read mahout cluster output.  Any thoughts?


Two thoughts on this...

1. Normally for indexes that include clustering, we re-generate the complete 
Solr index using a Hadoop-based workflow, which includes all of the 
processing/machine learning.

One reason why is that there's so much tweaking to get good results that you 
wind up often needing to rebuilt everything, versus trying to do incremental 
updates.

2. You could potentially put the data into external fields, but then it would 
need to be used via a FunctionQuery.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: getting mahout clustering info back into lucene

Reply via email to