The user list? Seems like JIRA would be a better place to discuss what
files I need to send but OK.
From the inputs to the ClusterEvaluator class I'll send:
1. conf.set(RepresentativePointsDriver.DISTANCE_MEASURE_KEY,
dm.getClass().getName());
---> org.apache.mahout.common.distance.CosineDistanceMeasure
I guess you can just make a note of this
2. conf.set(RepresentativePointsDriver.STATE_IN_KEY,
"tmp/representative/representativePoints-" + numIters);
---> representativePoints-5/*
Here 5 is the maxiter value used internally in clusterdump
3. ClusterEvaluator ce = new ClusterEvaluator(conf, finalClusters);
---> clusters-27-final/*
The final clusters dir of the k = 500 run.
I can't upload more than 10M to JIRA and this is 22M so here is a webdav
URL once again:
http://cloud.occamsmachete.com/public.php?service=files&token=ceae2302d5ef6a55737b5e48aaafe45a3eddc389&file=/cluster-eval.tar.gz
I hope I got it right this time. I don't think there is a cluster
evaluator driver so I'll throw something together to double check it myself.
Thanks,
Pat
On 7/13/12 1:40 PM, Jeff Eastman wrote:
The rep-points tar you sent doesn't look right. I was expecting a
directory of representativePoints-i where i is the number of
iterations you used to run the RepresentativePointsDriver. Each
iteration will add a single point to the evolving list of
representative points for each cluster.
And, next time you send clusters, please don't send the
clusteredPoints. All I need is the clusters-n-final directory and the
directory with the last representativePoints parts.
Finally, can we please do this on the list so it is searchable by
others? You can also upload the relevant files to the JIRA so we know
what we are dealing with.
Jeff
On 7/13/12 3:58 PM, Pat Ferrel wrote:
OK but I can't find it. It doesn't seem to be listed on the "mahout"
CL help. Maybe there's some way to tell the script to execute an
arbitrary driver?
Anyway I just wrote a few lines to execute it and sent you a link to
the output.
On 7/13/12 12:40 PM, Jeff Eastman wrote:
Sure there is.
On 7/13/12 12:36 PM, Pat Ferrel wrote:
So there is no command line way to run RepresentativePointsDriver?
I'll have to hack up something, might be more than a minute...
On 7/13/12 9:06 AM, Pat Ferrel wrote:
OK, didn't know there was a RepresentativePointsDriver. Give me a
few minutes.
On 7/13/12 9:04 AM, Jeff Eastman wrote:
Hi Pat,
You will need to run the RepresentativePointsDriver to extract a
set of representative points for your clusters. It expects a -i
input directory full of clusters (your final directory), a -cp
directory full of clustered points, an -o output directory for
the representative points, a distance measure, number of
iterations, etc.
The cluster dumper does for you this but it is not done by the
respective clustering algorithms.
With this data we can run the various evaluators on a consistent
and much smaller set of points to debug them further.
Jeff
On 7/11/12 4:43 PM, Pat Ferrel wrote:
D'oh... True that.
This has the final cluster part and the clusteredPoints dir. Are
"representative points" taken from clusteredPoints? Anyway let
me know if this is not what you need.
https://issues.apache.org/jira/browse/MAHOUT-1045
clusters CDbw Inter-Cluster Density CDbw Intra-Cluster
Density CDbw Separation CDbw Validity Index Inter-cluster
Density Intra-cluster Density
500 0 1050.07236806084 187792.321370176 1.97E+08
0.928988162001239 NaN
http://cloud.occamsmachete.com/public.php?service=files&token=5c527cbef78c26ea8c729a3b07f45de87011cb16&file=/4000-clusters-eval.tar.gz
On 7/11/12 1:17 PM, Jeff Eastman wrote:
It would be more useful for debugging if you could provide the
result clusters and a set of representative points for each.
These are more likely to be tractable in terms of debugging
than the entire 8G dataset.
On 7/11/12 3:40 PM, Pat Ferrel wrote:
As I've said before this issue is still a problem.
https://issues.apache.org/jira/browse/MAHOUT-1020?focusedCommentId=13409696#comment-13409696
This should be reopened and I sent you a link to get my data
(only 8G good luck!)
My confusion with the per cluster density measure is because
In 0.8 an output file is required for clusterdump but the per
cluster density measure is not written to it. It's in the lNFO
output to STDOUT. When I run a bunch of these the STDOUT is
lost so I'll have to modify my scripts or update my KFinder
code. I'd vote to include it in the output file in the future.
The only problem I've seen with the per cluster Intra-cluster
density is that I get a lot of pruned clusters sometimes and
the Intra-Cluster Density is not calculated for them. I think
we've discussed this in the past.
12/07/11 12:22:12 INFO evaluation.ClusterEvaluator:
Intra-Cluster Density[766] = 0.6243875150474454
I really would like to get this stuff working and am willing
to provide whatever help you need if you are in a position to
work on it. I have 0.8-SNAPSHOT building but am inexperienced
debugging in this kind of large data situation but willing to
learn. If you'd like me to try something out just point me in
the right direction.
I'm also happy to test Ted's inter-cluster stuff too.
On 7/11/12 11:46 AM, Jeff Eastman wrote:
The ClusterEvaluator has methods for both inter-cluster
density and intra-cluster density. The former computes the
density using the cluster centers, while the latter uses a
set of representative points extracted from the clustered
points. This reduces the computational overhead of
calculating a density from all of the points from each cluster.
The unit test uses synthetic data and produces reasonable
looking results afaict. Have you had negative experiences
with that?
On 7/11/12 1:21 PM, Pat Ferrel wrote:
...
It was my understanding that the ClusterEvaluator included
an attempt to provide this measure with intra-cluster
density per cluster though it looks like that output has
been removed?