Thanks Jeff. I have one more question. May i know the structure of contents of the part-m-* files in clusteredPoints. My interpretation is that each record is a key-value pair where "key" is the clusterID to which the vector and belongs and "value" is the point vector.
I want to write a different version of ClusterDumper code where a new file is created for each cluster and that file contains the points belonging to that cluster; the reason being the existing ClusterDumper code is unable to handle large dataset. Is my interpretation about the part-m-* correct..?? On Wed, Nov 9, 2011 at 11:27 PM, Jeff Eastman <[email protected]> wrote: > See inline, > Jeff > > -----Original Message----- > From: gaurav redkar [mailto:[email protected]] > Sent: Wednesday, November 09, 2011 4:09 AM > To: [email protected] > Subject: meanshift clustering > > Hi.. I am unable to identify where is the clusterPoints() function in the > MeanShiftCanopyClusterer.java file being called during the execution of > Meanshift job. > > > [jeff] That method is not called except by a unit test > TestMeanShift.testClustererReferenceImplementation. > > What i need to know is where are the files in clusteredPoints n clusters-* > directory being written when we run the job on hadoop. > > > [jeff] Those directories will be created within the --output directory > which you specify for your job > > buildclustersMR() creates the clusters-* directory for each iteration but i > am unable to locate the code which actually writes to d part-r-* files . > > > [jeff] The code which writes the part-r-* files is Hadoop code which is > called within MeanShiftCanopyReducer.reduce (line 55) > > > Any suggestions..?? > > > Thanks >
