Re: Mahout clustering from lucene index

Ankit Goel Sat, 10 Oct 2015 09:09:06 -0700

Hi,
In kmeans we need to specify number of clusters and directory of initial 
vectors. When you want random initial vectors, specify k (-k 5) and directory 
for initial vectors -or in this case where they will be saved. This is 
specified by -c ./cluster-directory/initial (thats my preference). You can 
obviously specify any location.


> On 10-Oct-2015, at 7:47 pm, Cristian Barrientos Montoya <cs3...@gmail.com> 
> wrote:
> 
> Hi there,
> I've been trying to run kmeans clustering on a lucene index, after creating
> the vectors with the command tool "lucene.vector", but the kmeans algorithm
> also needs a clusters input "-c", but I don't know where or how get these,
> would you give me some advice or another way to to the kmeans clustering ?
> 
> My case scenario is:
> A lot of resources gotten from apache nutch, the resources are on apache
> solr (v 5.2), so I exported on a json file to create an index on lucene (v
> 4.6), the resources are something like:
> 
> {
> "title": "Title #1",
> "summary": "summary of the resource",
> "url": "www.urlresources.com/resourceId.jpg",
> "description": "Some description",
> "extension": "jpg",
> "subject": "Subject of the resource",
> "area": "resource area"
> }
> 
> This is how I am indexing to lucene:
> https://gist.github.com/ColadaFF/1d6557ebaa147753bc9f
> 
> And the way I am generating vectors is the same as the example on the
> mahout page:
> https://mahout.apache.org/users/basics/creating-vectors-from-text.html
> 
> Am I in the right direction or should I use classification?
> 
> I'm also reading some resources, but all of them don't say what to do with
> the lucene vectors, so, also any resource you can give will be pretty great.
> 
> Thanks all of you!

Re: Mahout clustering from lucene index

Reply via email to