Well done and well described. Solr loud is a bit new but the need you expressed is a real one that will appear again.
Sent from my iPhone On May 1, 2013, at 15:47, Sebastián Ramírez <[email protected]> wrote: > Well, I found a simple (maybe dirty) solution for my problem. > > I write it here just for the record, as I understand these emails get > archived and can be accessible trough the web. > > --- > > The solution is to merge the indices of the needed SolrCores that form the > SolrCloud system into one index, and then create the vector as normally > from that big merged index. > > This solution may be suboptimal because if you have various really big > indices that you just can't merge, you would be out of hope. But if you can > afford merging the needed indices just to create the Mahout vector, then > this can work for you. > > > You just need to do the following, I tested it using Solr 4.2.1. > > You need two files: "lucene-core-VERSION.jar" and "lucene-misc-VERSION.jar" > (where "VERSION" is your Lucene/Solr version, that would be for example " > lucene-core-4.2.1.jar"). Those files are in the Solr directory, under " > ./example/solr-webapp/webapp/WEB-INF/lib/". > > You can go to that directory, so "cd example/solr-webapp/webapp/WEB-INF/lib/ > ". > > Then execute the following: > > java -cp lucene-core-VERSION.jar:lucene-misc-VERSION.jar > org/apache/lucene/misc/IndexMergeTool > /path/to/newindex > /path/to/index1 > /path/to/index2 > > > Replacing "VERSION" with your Lucene/Solr version and those "/path/to/index" > to your real indices paths. > > If you are using the Solr instance in the "example" directory, the index > path would be "./example/solr/collection1/data/index/". > > In the "/path/to/newindex" path will be you newly merged index, from where > you can create your Mahout vector. > > I made my solution based on this article: > http://docs.lucidworks.com/display/solr/Merging+Indexes > > > I hope this helps somebody somewhen too. > > Best Regards, > > Sebastián Ramírez > > > > On Mon, Apr 22, 2013 at 9:35 PM, Sebastian Ramirez < > [email protected]> wrote: > >> Hello everyone, >> >> I want to know if it's possible to do a clustering of documents in >> SolrCloud indices (multiple "index" directories) and how would one >> accomplish that. >> >> --- >> >> I'm using Solr 4.2.1 and Mahout 0.8-SNAPSHOT >> >> I can cluster documents from one Lucene/Solr index. I can even cluster >> documents from a Solr 4.x index (the same version that implements the >> distributed SolrCloud). >> >> As I know, SorlCloud uses indices distributed in files across "shards" as >> one big index. >> >> The problem is that although I can cluster documents from one index, from >> one "shard"/"SolrCore", I can't cluster the documents from the whole index. >> ...Or at least, I don't know how to do it. >> >> I used Mahout with the lucene.vector tool, it gets one index directory and >> outputs a "vector" file (if I'm not wrong) and a text "dictionary". Then I >> can use Mahout with, for example, kmeans to cluster the "documents". >> >> The problem is that I can only pass one index directory as an argument to >> lucene.vector, and if I had two "SolrCores"/"shards" I would have two >> "index" directories. >> >> I can even cluster the data that happened to be in one of those "index" >> directories, but not all the data in both (the complete index). >> I tried to pass the two directories to lucene.vector, I also tried to >> create both vectors and pass the directory in which they were to kmeans >> instead of passing the vector file directly ...but I always got an error. >> >> I don't know if it's possible to "merge" two vectors, or extract in some >> way a vector from the whole distributed index or "export" the indices in >> some format that can then be converted to a format Mahout supports... >> whatever that can be done may help... Is there anything that can be done? >> >> >> >> >> I'm really a newbie with Mahout and Solr, and I know that some of the >> things I wrote will sound as silly as a newbie many times sounds... >> >> So, many thanks for your patience and help! :) >> >> >> Sebastián Ramírez >> >> > > -- > *----------------------------------------------------* > *This e-mail transmission, including any attachments, is intended only for > the named recipient(s) and may contain information that is privileged, > confidential and/or exempt from disclosure under applicable law. If you > have received this transmission in error, or are not the named > recipient(s), please notify Senseta immediately by return e-mail and > permanently delete this transmission, including any attachments.*
