Re: why so hard to get doc->cluster mapping?

Lance Norskog Sat, 21 Apr 2012 16:25:42 -0700

Nothing should require a local or HDFS path. Which job/class is this?

On Fri, Apr 20, 2012 at 3:17 AM, Paritosh Ranjan <[email protected]> wrote:
> I am not sure about this, however I see a txt ( -i output.txt ) as the input
> to kmeans.  KMeans input is supposed to take a hdfs Path as input.
> Ignore if its a hdfs path.
>
>
> On 19-04-2012 03:23, Jeff Eastman wrote:
>>
>> Are you running seq2sparse in there somewhere? It has a -nv option that
>> will produce NamedVectors in its vector output. These will pass through the
>> clustering and be evident in the clusterdump output.
>>
>> On 4/18/12 3:08 PM, Robert Stewart wrote:
>>>
>>> I am running kmeans clustering on vectors extracted from a lucene index.
>>>
>>> What I want as my end result is a mapping of document ID to the cluster
>>> for each document.  How can I get that output?  I see many other people also
>>> want this but I dont see enough detail in any solution that helps me enough
>>> to get it.
>>>
>>> So far I do this:
>>>
>>> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text
>>> --idField id --output output.txt --dictOut dict.txt
>>>
>>> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
>>> clusters -cl
>>>
>>> ./mahout clusterdump --dictionary dict.txt --seqFileDir
>>> kmeans/clusters-10-final --dictionaryType text --pointsDir
>>> kmeans/clusteredPoints --output dump
>>>
>>> But what I see inside "dump" file does not contain any mapping from
>>> document ID to each cluster.  How can I get that?  Should not be this hard
>>> to get the most obvious/useful output IMO ;)
>>>
>>> Thanks
>>> Bob
>>>
>>>
>>>
>>
>




-- 
Lance Norskog
[email protected]

Re: why so hard to get doc->cluster mapping?

Reply via email to