+user@
-------- Original Message --------
Subject: Re: Cluster Evaluation 0.8 style
Date: Wed, 11 Jul 2012 16:17:29 -0400
From: Jeff Eastman <[email protected]>
To: Pat Ferrel <[email protected]>
It would be more useful for debugging if you could provide the result
clusters and a set of representative points for each. These are more
likely to be tractable in terms of debugging than the entire 8G dataset.
On 7/11/12 3:40 PM, Pat Ferrel wrote:
As I've said before this issue is still a problem.
https://issues.apache.org/jira/browse/MAHOUT-1020?focusedCommentId=13409696#comment-13409696
This should be reopened and I sent you a link to get my data (only 8G
good luck!)
My confusion with the per cluster density measure is because In 0.8 an
output file is required for clusterdump but the per cluster density
measure is not written to it. It's in the lNFO output to STDOUT. When
I run a bunch of these the STDOUT is lost so I'll have to modify my
scripts or update my KFinder code. I'd vote to include it in the
output file in the future.
The only problem I've seen with the per cluster Intra-cluster density
is that I get a lot of pruned clusters sometimes and the Intra-Cluster
Density is not calculated for them. I think we've discussed this in
the past.
12/07/11 12:22:12 INFO evaluation.ClusterEvaluator: Intra-Cluster
Density[766] = 0.6243875150474454
I really would like to get this stuff working and am willing to
provide whatever help you need if you are in a position to work on it.
I have 0.8-SNAPSHOT building but am inexperienced debugging in this
kind of large data situation but willing to learn. If you'd like me to
try something out just point me in the right direction.
I'm also happy to test Ted's inter-cluster stuff too.
On 7/11/12 11:46 AM, Jeff Eastman wrote:
The ClusterEvaluator has methods for both inter-cluster density and
intra-cluster density. The former computes the density using the
cluster centers, while the latter uses a set of representative points
extracted from the clustered points. This reduces the computational
overhead of calculating a density from all of the points from each
cluster.
The unit test uses synthetic data and produces reasonable looking
results afaict. Have you had negative experiences with that?
On 7/11/12 1:21 PM, Pat Ferrel wrote:
...
It was my understanding that the ClusterEvaluator included an
attempt to provide this measure with intra-cluster density per
cluster though it looks like that output has been removed?