Hi,

I am trying to cluster text with Canopy and K-Means. This is what I have and it 
works. But I’m curios if I should not somehow run K-Means with Tanimoto and 
Canopy with Euclidian instead? What is K-Means using in my setup? And why have 
the parameter for distance measure in KMeansDrivers run method been removed?

        //Generate input clusters for K-means (instead of using random K)
        CanopyDriver.run(conf,
                         TFIDF_VECTORS_PATH,
                         OUTPUT_PATH,
                         new TanimotoDistanceMeasure(),
                         t1,
                         t2,
                         runClusteringFalse,
                         clusterClassificationThreshold,
                         runSequential);
        
        //Generate K-Means clusters
        KMeansDriver.run(conf, 
                         TFIDF_VECTORS_PATH,
                         new Path(OUTPUT_PATH,"clusters-0-final"),
                         KMEANS_OUTPUT_PATH,
                         convergenceDelta,
                         maxIterations,
                         runClustering,
                         clusterClassificationThreshold,
                         runSequential);

Im wondering this since I read that Canopy runs good with a fast distance 
measure so I was thinking of using Euclidian on Canopy and Tanimoto on K-means. 
Probably totally wrong but if someone could explain this it would be great.

Thank you!

Best regards,
Nicklas

Reply via email to