+user@

-------- Original Message --------
Subject:        Re: Cluster Evaluation 0.8 style
Date:   Wed, 11 Jul 2012 16:17:29 -0400
From:   Jeff Eastman <[email protected]>
To:     Pat Ferrel <[email protected]>



It would be more useful for debugging if you could provide the result
clusters and a set of representative points for each. These are more
likely to be tractable in terms of debugging than the entire 8G dataset.


On 7/11/12 3:40 PM, Pat Ferrel wrote:
 As I've said before this issue is still a problem.
 
https://issues.apache.org/jira/browse/MAHOUT-1020?focusedCommentId=13409696#comment-13409696

 This should be reopened and I sent you a link to get my data (only 8G
 good luck!)

 My confusion with the per cluster density measure is because In 0.8 an
 output file is required for clusterdump but the per cluster density
 measure is not written to it. It's in the lNFO output to STDOUT. When
 I run a bunch of these the STDOUT is lost so I'll have to modify my
 scripts or update my KFinder code. I'd vote to include it in the
 output file in the future.

 The only problem I've seen with the per cluster Intra-cluster density
 is that I get a lot of pruned clusters sometimes and the Intra-Cluster
 Density is not calculated for them. I think we've discussed this in
 the past.

 12/07/11 12:22:12 INFO evaluation.ClusterEvaluator: Intra-Cluster
 Density[766] = 0.6243875150474454

 I really would like to get this stuff working and am willing to
 provide whatever help you need if you are in a position to work on it.
 I have 0.8-SNAPSHOT building but am inexperienced debugging in this
 kind of large data situation but willing to learn. If you'd like me to
 try something out just point me in the right direction.

 I'm also happy to test Ted's inter-cluster stuff too.


 On 7/11/12 11:46 AM, Jeff Eastman wrote:
 The ClusterEvaluator has methods for both inter-cluster density and
 intra-cluster density. The former computes the density using the
 cluster centers, while the latter uses a set of representative points
 extracted from the clustered points. This reduces the computational
 overhead of calculating a density from all of the points from each
 cluster.

 The unit test uses synthetic data and produces reasonable looking
 results afaict. Have you had negative experiences with that?

 On 7/11/12 1:21 PM, Pat Ferrel wrote:
 ...

 It was my understanding that the ClusterEvaluator included an
 attempt to provide this measure with intra-cluster density per
 cluster though it looks like that output has been removed?








Reply via email to