Close. The "value" is actually a WeightedVectorWritable which includes the probability of the vector being a member of the given cluster "key". For MeanShift, Canopy and K-Means this is always 1.0 since these are maximum-likelihood clusterers. For FuzzyK and Dirichlet, the probability will be fractional (with an optional filter threshold, default=0, to cull the outliers) or you can select EmitMostLikely also to mimic the other clusterers.
Also, check out MAHOUT-843 where we are working on a postprocessor (for hierarchical clustering) which takes clusteredPoints and sorts them into individual directories for the next clustering phase as you describe. Feel free to contribute your ideas to this issue. -----Original Message----- From: gaurav redkar [mailto:[email protected]] Sent: Thursday, November 10, 2011 10:01 PM To: [email protected] Subject: Re: meanshift clustering Thanks Jeff. I have one more question. May i know the structure of contents of the part-m-* files in clusteredPoints. My interpretation is that each record is a key-value pair where "key" is the clusterID to which the vector and belongs and "value" is the point vector. I want to write a different version of ClusterDumper code where a new file is created for each cluster and that file contains the points belonging to that cluster; the reason being the existing ClusterDumper code is unable to handle large dataset. Is my interpretation about the part-m-* correct..?? On Wed, Nov 9, 2011 at 11:27 PM, Jeff Eastman <[email protected]> wrote: > See inline, > Jeff > > -----Original Message----- > From: gaurav redkar [mailto:[email protected]] > Sent: Wednesday, November 09, 2011 4:09 AM > To: [email protected] > Subject: meanshift clustering > > Hi.. I am unable to identify where is the clusterPoints() function in the > MeanShiftCanopyClusterer.java file being called during the execution of > Meanshift job. > > > [jeff] That method is not called except by a unit test > TestMeanShift.testClustererReferenceImplementation. > > What i need to know is where are the files in clusteredPoints n clusters-* > directory being written when we run the job on hadoop. > > > [jeff] Those directories will be created within the --output directory > which you specify for your job > > buildclustersMR() creates the clusters-* directory for each iteration but i > am unable to locate the code which actually writes to d part-r-* files . > > > [jeff] The code which writes the part-r-* files is Hadoop code which is > called within MeanShiftCanopyReducer.reduce (line 55) > > > Any suggestions..?? > > > Thanks >
