Re: mahout lucene.vector from multiple solrcloud "index" directories for kmeans

Ted Dunning Wed, 01 May 2013 23:35:16 -0700

Well done and well described.   

Solr loud is a bit new but the need you expressed is a real one that will 
appear again.


Sent from my iPhone

On May 1, 2013, at 15:47, Sebastián Ramírez <[email protected]> 
wrote:

> Well, I found a simple (maybe dirty) solution for my problem.
> 
> I write it here just for the record, as I understand these emails get
> archived and can be accessible trough the web.
> 
> ---
> 
> The solution is to merge the indices of the needed SolrCores that form the
> SolrCloud system into one index, and then create the vector as normally
> from that big merged index.
> 
> This solution may be suboptimal because if you have various really big
> indices that you just can't merge, you would be out of hope. But if you can
> afford merging the needed indices just to create the Mahout vector, then
> this can work for you.
> 
> 
> You just need to do the following, I tested it using Solr 4.2.1.
> 
> You need two files: "lucene-core-VERSION.jar" and "lucene-misc-VERSION.jar"
> (where "VERSION" is your Lucene/Solr version, that would be for example "
> lucene-core-4.2.1.jar"). Those files are in the Solr directory, under "
> ./example/solr-webapp/webapp/WEB-INF/lib/".
> 
> You can go to that directory, so "cd example/solr-webapp/webapp/WEB-INF/lib/
> ".
> 
> Then execute the following:
> 
> java -cp lucene-core-VERSION.jar:lucene-misc-VERSION.jar
> org/apache/lucene/misc/IndexMergeTool
> /path/to/newindex
> /path/to/index1
> /path/to/index2
> 
> 
> Replacing "VERSION" with your Lucene/Solr version and those "/path/to/index"
> to your real indices paths.
> 
> If you are using the Solr instance in the "example" directory, the index
> path would be "./example/solr/collection1/data/index/".
> 
> In the "/path/to/newindex" path will be you newly merged index, from where
> you can create your Mahout vector.
> 
> I made my solution based on this article:
> http://docs.lucidworks.com/display/solr/Merging+Indexes
> 
> 
> I hope this helps somebody somewhen too.
> 
> Best Regards,
> 
> Sebastián Ramírez
> 
> 
> 
> On Mon, Apr 22, 2013 at 9:35 PM, Sebastian Ramirez <
> [email protected]> wrote:
> 
>> Hello everyone,
>> 
>> I want to know if it's possible to do a clustering of documents in
>> SolrCloud indices (multiple "index" directories) and how would one
>> accomplish that.
>> 
>> ---
>> 
>> I'm using Solr 4.2.1 and Mahout 0.8-SNAPSHOT
>> 
>> I can cluster documents from one Lucene/Solr index. I can even cluster
>> documents from a Solr 4.x index (the same version that implements the
>> distributed SolrCloud).
>> 
>> As I know, SorlCloud uses indices distributed in files across "shards" as
>> one big index.
>> 
>> The problem is that although I can cluster documents from one index, from
>> one "shard"/"SolrCore", I can't cluster the documents from the whole index.
>> ...Or at least, I don't know how to do it.
>> 
>> I used Mahout with the lucene.vector tool, it gets one index directory and
>> outputs a "vector" file (if I'm not wrong) and a text "dictionary". Then I
>> can use Mahout with, for example, kmeans to cluster the "documents".
>> 
>> The problem is that I can only pass one index directory as an argument to
>> lucene.vector, and if I had two "SolrCores"/"shards" I would have two
>> "index" directories.
>> 
>> I can even cluster the data that happened to be in one of those "index"
>> directories, but not all the data in both (the complete index).
>> I tried to pass the two directories to lucene.vector, I also tried to
>> create both vectors and pass the directory in which they were to kmeans
>> instead of passing the vector file directly ...but I always got an error.
>> 
>> I don't know if it's possible to "merge" two vectors, or extract in some
>> way a vector from the whole distributed index or "export" the indices in
>> some format that can then be converted to a format Mahout supports...
>> whatever that can be done may help... Is there anything that can be done?
>> 
>> 
>> 
>> 
>> I'm really a newbie with Mahout and Solr, and I know that some of the
>> things I wrote will sound as silly as a newbie many times sounds...
>> 
>> So, many thanks for your patience and help! :)
>> 
>> 
>> Sebastián Ramírez
>> 
>> 
> 
> -- 
> *----------------------------------------------------*
> *This e-mail transmission, including any attachments, is intended only for 
> the named recipient(s) and may contain information that is privileged, 
> confidential and/or exempt from disclosure under applicable law. If you 
> have received this transmission in error, or are not the named 
> recipient(s), please notify Senseta immediately by return e-mail and 
> permanently delete this transmission, including any attachments.*

Re: mahout lucene.vector from multiple solrcloud "index" directories for kmeans

Reply via email to