On Nov 5, 2011, at 7:06am, Grant Ingersoll wrote: > > On Nov 5, 2011, at 8:36 AM, Robert Stewart wrote: > >> If I run mahout clustering on lucene vectors, how would I go about getting >> that cluster information back into lucene, in order to use the cluster >> identifiers in field collapsing? >> > > Since Lucene doesn't have incremental field update (which is seriously > non-trivial to do in an inverted index), the only way to do this is to > re-index. Once DocValues are updateable, this may be a lot easier. You > could, also, perhaps use the ParallelReader, but that has some restrictions > (you have to keep docids in sync) > >> I know I can re-index with the new cluster info, but is there any way to put >> cluster info into an existing index (which also may be non-optimized and >> quite large)? One way maybe to have a custom field collapsing component >> that can read mahout cluster output. Any thoughts?
Two thoughts on this... 1. Normally for indexes that include clustering, we re-generate the complete Solr index using a Hadoop-based workflow, which includes all of the processing/machine learning. One reason why is that there's so much tweaking to get good results that you wind up often needing to rebuilt everything, versus trying to do incremental updates. 2. You could potentially put the data into external fields, but then it would need to be used via a FunctionQuery. -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
