Close. The "value" is actually a WeightedVectorWritable which includes the 
probability of the vector being a member of the given cluster "key". For 
MeanShift, Canopy and K-Means this is always 1.0 since these are 
maximum-likelihood clusterers. For FuzzyK and Dirichlet, the probability will 
be fractional (with an optional filter threshold, default=0, to cull the 
outliers) or you can select EmitMostLikely also to mimic the other clusterers.

Also, check out MAHOUT-843 where we are working on a postprocessor (for 
hierarchical clustering) which takes clusteredPoints and sorts them into 
individual directories for the next clustering phase as you describe. Feel free 
to contribute your ideas to this issue.

-----Original Message-----
From: gaurav redkar [mailto:[email protected]] 
Sent: Thursday, November 10, 2011 10:01 PM
To: [email protected]
Subject: Re: meanshift clustering

Thanks Jeff. I have one more question. May i know the structure of contents
of the part-m-* files in clusteredPoints. My interpretation is that each
record is a key-value pair where "key" is the clusterID to which the
vector  and belongs and "value" is the point vector.

I want to write a different version of ClusterDumper code where a new file
is created for each cluster and that file contains the points belonging to
that cluster; the reason being the existing ClusterDumper code is unable to
handle large dataset. Is my interpretation about the part-m-*  correct..??

On Wed, Nov 9, 2011 at 11:27 PM, Jeff Eastman <[email protected]> wrote:

> See inline,
> Jeff
>
> -----Original Message-----
> From: gaurav redkar [mailto:[email protected]]
> Sent: Wednesday, November 09, 2011 4:09 AM
> To: [email protected]
> Subject: meanshift clustering
>
> Hi.. I am unable to identify where is the clusterPoints() function in the
> MeanShiftCanopyClusterer.java file being called during the execution of
> Meanshift job.
>
>
> [jeff] That method is not called except by a unit test
> TestMeanShift.testClustererReferenceImplementation.
>
> What i need to know is where are the files in clusteredPoints n clusters-*
> directory being written when we run  the job on hadoop.
>
>
> [jeff] Those directories will be created within the --output directory
> which you specify for your job
>
> buildclustersMR() creates the clusters-* directory for each iteration but i
> am unable to locate the code which actually writes to d part-r-* files .
>
>
> [jeff] The code which writes the part-r-* files is Hadoop code which is
> called within MeanShiftCanopyReducer.reduce (line 55)
>
>
> Any suggestions..??
>
>
> Thanks
>

Reply via email to