The user list? Seems like JIRA would be a better place to discuss what files I need to send but OK.

From the inputs to the ClusterEvaluator class I'll send:

1. conf.set(RepresentativePointsDriver.DISTANCE_MEASURE_KEY,
   dm.getClass().getName());
   ---> org.apache.mahout.common.distance.CosineDistanceMeasure
   I guess you can just make a note of this

2. conf.set(RepresentativePointsDriver.STATE_IN_KEY,
   "tmp/representative/representativePoints-" + numIters);
   ---> representativePoints-5/*
   Here 5 is the maxiter value used internally in clusterdump

3. ClusterEvaluator ce = new ClusterEvaluator(conf, finalClusters);
   ---> clusters-27-final/*
   The final clusters dir of the k = 500 run.

I can't upload more than 10M to JIRA and this is 22M so here is a webdav URL once again:
http://cloud.occamsmachete.com/public.php?service=files&token=ceae2302d5ef6a55737b5e48aaafe45a3eddc389&file=/cluster-eval.tar.gz

I hope I got it right this time. I don't think there is a cluster evaluator driver so I'll throw something together to double check it myself.

Thanks,
Pat

On 7/13/12 1:40 PM, Jeff Eastman wrote:
The rep-points tar you sent doesn't look right. I was expecting a directory of representativePoints-i where i is the number of iterations you used to run the RepresentativePointsDriver. Each iteration will add a single point to the evolving list of representative points for each cluster.

And, next time you send clusters, please don't send the clusteredPoints. All I need is the clusters-n-final directory and the directory with the last representativePoints parts.

Finally, can we please do this on the list so it is searchable by others? You can also upload the relevant files to the JIRA so we know what we are dealing with.

Jeff


On 7/13/12 3:58 PM, Pat Ferrel wrote:
OK but I can't find it. It doesn't seem to be listed on the "mahout" CL help. Maybe there's some way to tell the script to execute an arbitrary driver?

Anyway I just wrote a few lines to execute it and sent you a link to the output.

On 7/13/12 12:40 PM, Jeff Eastman wrote:
Sure there is.

On 7/13/12 12:36 PM, Pat Ferrel wrote:
So there is no command line way to run RepresentativePointsDriver? I'll have to hack up something, might be more than a minute...

On 7/13/12 9:06 AM, Pat Ferrel wrote:
OK, didn't know there was a RepresentativePointsDriver. Give me a few minutes.

On 7/13/12 9:04 AM, Jeff Eastman wrote:
Hi Pat,

You will need to run the RepresentativePointsDriver to extract a set of representative points for your clusters. It expects a -i input directory full of clusters (your final directory), a -cp directory full of clustered points, an -o output directory for the representative points, a distance measure, number of iterations, etc.

The cluster dumper does for you this but it is not done by the respective clustering algorithms.

With this data we can run the various evaluators on a consistent and much smaller set of points to debug them further.

Jeff


On 7/11/12 4:43 PM, Pat Ferrel wrote:
D'oh... True that.

This has the final cluster part and the clusteredPoints dir. Are "representative points" taken from clusteredPoints? Anyway let me know if this is not what you need.

https://issues.apache.org/jira/browse/MAHOUT-1045
clusters CDbw Inter-Cluster Density CDbw Intra-Cluster Density CDbw Separation CDbw Validity Index Inter-cluster Density Intra-cluster Density 500 0 1050.07236806084 187792.321370176 1.97E+08 0.928988162001239 NaN

http://cloud.occamsmachete.com/public.php?service=files&token=5c527cbef78c26ea8c729a3b07f45de87011cb16&file=/4000-clusters-eval.tar.gz



On 7/11/12 1:17 PM, Jeff Eastman wrote:
It would be more useful for debugging if you could provide the result clusters and a set of representative points for each. These are more likely to be tractable in terms of debugging than the entire 8G dataset.


On 7/11/12 3:40 PM, Pat Ferrel wrote:
As I've said before this issue is still a problem.
https://issues.apache.org/jira/browse/MAHOUT-1020?focusedCommentId=13409696#comment-13409696 This should be reopened and I sent you a link to get my data (only 8G good luck!)

My confusion with the per cluster density measure is because In 0.8 an output file is required for clusterdump but the per cluster density measure is not written to it. It's in the lNFO output to STDOUT. When I run a bunch of these the STDOUT is lost so I'll have to modify my scripts or update my KFinder code. I'd vote to include it in the output file in the future.

The only problem I've seen with the per cluster Intra-cluster density is that I get a lot of pruned clusters sometimes and the Intra-Cluster Density is not calculated for them. I think we've discussed this in the past.

12/07/11 12:22:12 INFO evaluation.ClusterEvaluator: Intra-Cluster Density[766] = 0.6243875150474454

I really would like to get this stuff working and am willing to provide whatever help you need if you are in a position to work on it. I have 0.8-SNAPSHOT building but am inexperienced debugging in this kind of large data situation but willing to learn. If you'd like me to try something out just point me in the right direction.

I'm also happy to test Ted's inter-cluster stuff too.


On 7/11/12 11:46 AM, Jeff Eastman wrote:
The ClusterEvaluator has methods for both inter-cluster density and intra-cluster density. The former computes the density using the cluster centers, while the latter uses a set of representative points extracted from the clustered points. This reduces the computational overhead of calculating a density from all of the points from each cluster.

The unit test uses synthetic data and produces reasonable looking results afaict. Have you had negative experiences with that?

On 7/11/12 1:21 PM, Pat Ferrel wrote:
...

It was my understanding that the ClusterEvaluator included an attempt to provide this measure with intra-cluster density per cluster though it looks like that output has been removed?


























Reply via email to