Re: Intermittant Test Failure: testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)

2010-04-30 Thread Jeff Eastman
going to look into reducing the numbers of iterations on the clustering tests which are some of the culprits. On 4/29/10 6:15 PM, Grant Ingersoll wrote: On Apr 29, 2010, at 6:36 PM, Jeff Eastman wrote: right at the end of the 15 min core tests which makes it especially annoying. L

NamedVector Run Amok?

2010-04-27 Thread Jeff Eastman
Hi Sean, I was under the impression that the recently refactored NamedVectors would be just another kind of Vector and that they would not need to show up in method signatures unless there really was a requirement for that explicit type. What I see now in many places in the clustering code is

[jira] Updated: (MAHOUT-236) Cluster Evaluation Tools

2010-04-27 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-236: Attachment: MAHOUT-236.patch Here's a new patch that has initial, probably inco

Re: [jira] Created: (MAHOUT-387) Cosine item similarity implementation

2010-04-27 Thread Jeff Eastman
From Mahout In Action: You may be searching for something like “CosineMeasureSimilarity” in Mahout. You’veactuallyalreadyfounditbutunder anunexpectedname: PearsonCorrelationSimilarity. The cosine measure similarity and Pearson correlation aren’t the same thin

Re: [jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-27 Thread Jeff Eastman
Ok, just checking. I've got an initial implementation that I'm debugging and will post a patch soon. The equations in the paper still leave a bit to the student from a completeness perspective. On 4/27/10 12:15 AM, Robin Anil (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-

Re: [jira] Commented: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-27 Thread Jeff Eastman
I'm not arguing it is a performance improvement for sparse vectors, just that changing the class of the vector should not be necessary: if the vectors being clustered are dense then the cluster constructors should leave them dense. If the vectors that are being clustered are of a sparse variety

Re: How to tackle Vector->NamedVector and back conversion

2010-04-26 Thread Jeff Eastman
Correct, all the clustered points are now clusterId -> VectorWritable. This reflects some loss of generality for the two fuzzy clusterers (fuzzyK, Dirichlet) and I will likely need to add another clustering option for them that includes probability of membership. But for now and for the CDbw ca

Re: announcing new TLPs [was: ASF Board Meeting Summary - April 21, 2010 - new TLP reporting schedule?]

2010-04-26 Thread Jeff Eastman
+1 On 4/26/10 5:24 AM, Grant Ingersoll wrote: My edits inline. On Apr 26, 2010, at 3:45 AM, Sean Owen wrote: Here's my suggested boilerplate -- see below and please suggest edits if desired. There's a 150 word limit. Apache Mahout provides scalable implementations of machine learning alg

Re: Mahout In Action

2010-04-23 Thread Jeff Eastman
See ClusterBase for those constants On 4/23/10 11:53 AM, Robin Anil wrote: May I suggest keeping constants in a public String value. That way people will not hard code clsuters-0 and so on and instead use Clusterer.CLUSTER_DIR On Fri, Apr 23, 2010 at 11:55 PM, Jeff Eastman wrote:

Re: Mahout In Action

2010-04-23 Thread Jeff Eastman
ntion and the book will follow(shouldn't be the other way around) Robin On Fri, Apr 23, 2010 at 11:30 PM, Jeff Eastman wrote: The APIs did not change but the clustered points directory changed from "points" to "clusteredPoints" and the various clusters directori

Re: Mahout In Action

2010-04-23 Thread Jeff Eastman
ain on trunk? On 4/23/10 9:10 AM, Sean Owen wrote: Good eye, this was fixed in the manuscript a while ago. I will ping Manning to re-publish Chapters 1-6 since a lot of small updates have happened since then. On Fri, Apr 23, 2010 at 4:53 PM, Jeff Eastman wrote: Section 4.

Mahout In Action

2010-04-23 Thread Jeff Eastman
Section 4.5.1 says: "The third line shows how it is based on item-item similarities, not user-user similarities as before. The algorithms are similar, but not entirely symmetric. They do have notably different properties. For instance, the running time of an item-based recommender scales up as

[jira] Updated: (MAHOUT-236) Cluster Evaluation Tools

2010-04-21 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-236: Attachment: MAHOUT-236.patch This patch runs on top of Sean's latest patch (r936453) and a

[jira] Updated: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-236: Attachment: MAHOUT-236.patch I made some small changes to fuzzyK clustering and now the evaluator

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-20 Thread Jeff Eastman
+1 As Robin noted, this patch will affect some of the clustering code and it will conflict with the changes I've been working for MAHOUT-236. On balance, fixing the whole Vector equivalence mess seems prudent and I will deal with the rework. You've done a pile of work here and I think factoring

[jira] Updated: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-236: Attachment: MAHOUT-236.patch Here's a patch that adds a CDbw reference point MR job that ite