Re: why so hard to get doc->cluster mapping?

Yuval Feinstein Sun, 22 Apr 2012 00:18:06 -0700

Hi Robert.
I did similar stuff but using seq2sparse.
I believe that once you preserve the document id when creating an input
file for kmeans,
the clusterdump will give you the document id as part of its output.
I suggest that you try this flag when running ./mahout lucene.vector:


 --idField idField                                   The field in the index
                                                      containing the
index.  If
                                                      null, then the Lucene
                                                      internal doc id is
used
                                                      which is prone to
error
                                                      if the underlying
index
                                                      changes

Good luck,
Yuval


On Sun, Apr 22, 2012 at 8:13 AM, Paritosh Ranjan <[email protected]> wrote:

> The job is kmeans, class is KMeansDriver. The input there is supposed to
> be a org.apache.hadoop.fs.Path containing a sequence file.
> I see a txt file being passed.
>
>
> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
> clusters -cl
>
>
> On 22-04-2012 04:55, Lance Norskog wrote:
>
>> Nothing should require a local or HDFS path. Which job/class is this?
>>
>> On Fri, Apr 20, 2012 at 3:17 AM, Paritosh Ranjan<[email protected]>
>>  wrote:
>>
>>> I am not sure about this, however I see a txt ( -i output.txt ) as the
>>> input
>>> to kmeans.  KMeans input is supposed to take a hdfs Path as input.
>>> Ignore if its a hdfs path.
>>>
>>>
>>> On 19-04-2012 03:23, Jeff Eastman wrote:
>>>
>>>> Are you running seq2sparse in there somewhere? It has a -nv option that
>>>> will produce NamedVectors in its vector output. These will pass through
>>>> the
>>>> clustering and be evident in the clusterdump output.
>>>>
>>>> On 4/18/12 3:08 PM, Robert Stewart wrote:
>>>>
>>>>> I am running kmeans clustering on vectors extracted from a lucene
>>>>> index.
>>>>>
>>>>> What I want as my end result is a mapping of document ID to the cluster
>>>>> for each document.  How can I get that output?  I see many other
>>>>> people also
>>>>> want this but I dont see enough detail in any solution that helps me
>>>>> enough
>>>>> to get it.
>>>>>
>>>>> So far I do this:
>>>>>
>>>>> ./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text
>>>>> --idField id --output output.txt --dictOut dict.txt
>>>>>
>>>>> ./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters
>>>>> clusters -cl
>>>>>
>>>>> ./mahout clusterdump --dictionary dict.txt --seqFileDir
>>>>> kmeans/clusters-10-final --dictionaryType text --pointsDir
>>>>> kmeans/clusteredPoints --output dump
>>>>>
>>>>> But what I see inside "dump" file does not contain any mapping from
>>>>> document ID to each cluster.  How can I get that?  Should not be this
>>>>> hard
>>>>> to get the most obvious/useful output IMO ;)
>>>>>
>>>>> Thanks
>>>>> Bob
>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>

Re: why so hard to get doc->cluster mapping?

Reply via email to