Re: Cluster Evaluation 0.8 style

Pat Ferrel Fri, 13 Jul 2012 16:16:32 -0700

The user list? Seems like JIRA would be a better place to discuss whatfiles I need to send but OK.


From the inputs to the ClusterEvaluator class I'll send:


1. conf.set(RepresentativePointsDriver.DISTANCE_MEASURE_KEY,
   dm.getClass().getName());
   ---> org.apache.mahout.common.distance.CosineDistanceMeasure
   I guess you can just make a note of this

2. conf.set(RepresentativePointsDriver.STATE_IN_KEY,
   "tmp/representative/representativePoints-" + numIters);
   ---> representativePoints-5/*
   Here 5 is the maxiter value used internally in clusterdump

3. ClusterEvaluator ce = new ClusterEvaluator(conf, finalClusters);
   ---> clusters-27-final/*
   The final clusters dir of the k = 500 run.

I can't upload more than 10M to JIRA and this is 22M so here is a webdavURL once again:

http://cloud.occamsmachete.com/public.php?service=files&token=ceae2302d5ef6a55737b5e48aaafe45a3eddc389&file=/cluster-eval.tar.gz

I hope I got it right this time. I don't think there is a clusterevaluator driver so I'll throw something together to double check it myself.


Thanks,
Pat

On 7/13/12 1:40 PM, Jeff Eastman wrote:

The rep-points tar you sent doesn't look right. I was expecting adirectory of representativePoints-i where i is the number ofiterations you used to run the RepresentativePointsDriver. Eachiteration will add a single point to the evolving list ofrepresentative points for each cluster.
And, next time you send clusters, please don't send theclusteredPoints. All I need is the clusters-n-final directory and thedirectory with the last representativePoints parts.
Finally, can we please do this on the list so it is searchable byothers? You can also upload the relevant files to the JIRA so we knowwhat we are dealing with.
Jeff


On 7/13/12 3:58 PM, Pat Ferrel wrote:
OK but I can't find it. It doesn't seem to be listed on the "mahout"CL help. Maybe there's some way to tell the script to execute anarbitrary driver?
Anyway I just wrote a few lines to execute it and sent you a link tothe output.
On 7/13/12 12:40 PM, Jeff Eastman wrote:
Sure there is.

On 7/13/12 12:36 PM, Pat Ferrel wrote:
So there is no command line way to run RepresentativePointsDriver?I'll have to hack up something, might be more than a minute...
On 7/13/12 9:06 AM, Pat Ferrel wrote:
OK, didn't know there was a RepresentativePointsDriver. Give me afew minutes.
On 7/13/12 9:04 AM, Jeff Eastman wrote:
Hi Pat,
You will need to run the RepresentativePointsDriver to extract aset of representative points for your clusters. It expects a -iinput directory full of clusters (your final directory), a -cpdirectory full of clustered points, an -o output directory forthe representative points, a distance measure, number ofiterations, etc.
The cluster dumper does for you this but it is not done by therespective clustering algorithms.
With this data we can run the various evaluators on a consistentand much smaller set of points to debug them further.
Jeff


On 7/11/12 4:43 PM, Pat Ferrel wrote:
D'oh... True that.
This has the final cluster part and the clusteredPoints dir. Are"representative points" taken from clusteredPoints? Anyway letme know if this is not what you need.
https://issues.apache.org/jira/browse/MAHOUT-1045
clusters CDbw Inter-Cluster Density CDbw Intra-ClusterDensity CDbw Separation CDbw Validity Index Inter-clusterDensity Intra-cluster Density500 0 1050.07236806084 187792.321370176 1.97E+080.928988162001239 NaN
http://cloud.occamsmachete.com/public.php?service=files&token=5c527cbef78c26ea8c729a3b07f45de87011cb16&file=/4000-clusters-eval.tar.gz
On 7/11/12 1:17 PM, Jeff Eastman wrote:
It would be more useful for debugging if you could provide theresult clusters and a set of representative points for each.These are more likely to be tractable in terms of debuggingthan the entire 8G dataset.
On 7/11/12 3:40 PM, Pat Ferrel wrote:
As I've said before this issue is still a problem.
https://issues.apache.org/jira/browse/MAHOUT-1020?focusedCommentId=13409696#comment-13409696This should be reopened and I sent you a link to get my data(only 8G good luck!)
My confusion with the per cluster density measure is becauseIn 0.8 an output file is required for clusterdump but the percluster density measure is not written to it. It's in the lNFOoutput to STDOUT. When I run a bunch of these the STDOUT islost so I'll have to modify my scripts or update my KFindercode. I'd vote to include it in the output file in the future.
The only problem I've seen with the per cluster Intra-clusterdensity is that I get a lot of pruned clusters sometimes andthe Intra-Cluster Density is not calculated for them. I thinkwe've discussed this in the past.
12/07/11 12:22:12 INFO evaluation.ClusterEvaluator:Intra-Cluster Density[766] = 0.6243875150474454
I really would like to get this stuff working and am willingto provide whatever help you need if you are in a position towork on it. I have 0.8-SNAPSHOT building but am inexperienceddebugging in this kind of large data situation but willing tolearn. If you'd like me to try something out just point me inthe right direction.
I'm also happy to test Ted's inter-cluster stuff too.


On 7/11/12 11:46 AM, Jeff Eastman wrote:
The ClusterEvaluator has methods for both inter-clusterdensity and intra-cluster density. The former computes thedensity using the cluster centers, while the latter uses aset of representative points extracted from the clusteredpoints. This reduces the computational overhead ofcalculating a density from all of the points from each cluster.
The unit test uses synthetic data and produces reasonablelooking results afaict. Have you had negative experienceswith that?
On 7/11/12 1:21 PM, Pat Ferrel wrote:
...
It was my understanding that the ClusterEvaluator includedan attempt to provide this measure with intra-clusterdensity per cluster though it looks like that output hasbeen removed?

Re: Cluster Evaluation 0.8 style

Reply via email to