Re: [jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Pallavi Palleti
I will add my patch with in 3 to 4 days. I am done with everything. except that I need to write some test classes. Thanks Pallavi Robin Anil (JIRA) wrote: [

[jira] Updated: (MAHOUT-283) Update assemblies to include mahout-collections for release build

2010-02-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-283: - Resolution: Fixed Fix Version/s: (was: 0.3) 0.4 Assignee: Drew

[jira] Resolved: (MAHOUT-280) Clean some redundant POM declarations

2010-02-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-280. -- Resolution: Won't Fix Fix Version/s: (was: 0.3) 0.4 This may be

[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Pallavi Palleti (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-153: --- Attachment: Mahout-153.patch Here is the patch for selecting initial clusters for a

[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Pallavi Palleti (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831391#action_12831391 ] Pallavi Palleti commented on MAHOUT-153: Forgot to mention. The above patch doesn't

[jira] Created: (MAHOUT-284) In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero

2010-02-09 Thread Pallavi Palleti (JIRA)
In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero

[jira] Updated: (MAHOUT-284) In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero

2010-02-09 Thread Pallavi Palleti (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-284: --- Attachment: Mahout-284.patch This patch fix the issue In Fuzzy Kmeans, when the distance

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831396#action_12831396 ] Jake Mannix commented on MAHOUT-237: {code} RandomAccessSparseVector vector =

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831410#action_12831410 ] Sean Owen commented on MAHOUT-237: -- I dunno, I think of it as exactly that flag, doesn't

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831413#action_12831413 ] Jake Mannix commented on MAHOUT-237: I think of it as that flag as well, but when doing

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831420#action_12831420 ] Jake Mannix commented on MAHOUT-237: I do notice that recently added to this set of

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831428#action_12831428 ] Robin Anil commented on MAHOUT-237: --- You just needed the count? You could always

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Sean
I don't, but can offer alternatives -- Just have the user download the data set. I don't think this is a big burden. Download the data set automatically. These are free of legal and tarball-size problems. On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil robin.a...@gmail.com wrote: I feel a need to

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Make the maven test phase download this dataset once for all tests ? Is that possible On Tue, Feb 9, 2010 at 7:43 PM, Sean sro...@gmail.com wrote: I don't, but can offer alternatives -- Just have the user download the data set. I don't think this is a big burden. Download the data set

[jira] Created: (MAHOUT-285) Wrap up collocation and dictionary vectorizer integration

2010-02-09 Thread Drew Farris (JIRA)
Wrap up collocation and dictionary vectorizer integration - Key: MAHOUT-285 URL: https://issues.apache.org/jira/browse/MAHOUT-285 Project: Mahout Issue Type: Improvement Affects

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
Sure, how about a bunch of Apache project websites? The project name is the category, i.e. Lucene, Tomcat, Hadoop, etc. On Feb 9, 2010, at 9:11 AM, Robin Anil wrote: I feel a need to check in a set of text documents to mahout. maybe 3-4 categories of documents 10 each. can be used in

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Yeah that sounds ok. Do we have the pure content without html ? Robin On Tue, Feb 9, 2010 at 8:24 PM, Grant Ingersoll gsing...@apache.org wrote: Sure, how about a bunch of Apache project websites? The project name is the category, i.e. Lucene, Tomcat, Hadoop, etc. On Feb 9, 2010, at 9:11

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
On Feb 9, 2010, at 9:56 AM, Robin Anil wrote: Yeah that sounds ok. Do we have the pure content without html ? No, but I was just thinking yesterday that a really nice enhancement to the Doc. Vectorizer would be to hook in Tika, such that one could M/R binary files into Mahout vectors.

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Yeah!. Tika looks great!. I bet Drew's patch to create a structured document format via Avro should essentially go into Tika. Then we could really use the Tika library to the full. I should really spend time to explore Apache projects. I think we could reuse a whole lot. Robin On Tue, Feb 9,

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
On Feb 9, 2010, at 10:24 AM, Robin Anil wrote: Yeah!. Tika looks great!. I bet Drew's patch to create a structured document format via Avro should essentially go into Tika. Then we could really use the Tika library to the full. Solr has code here that would be pretty simple to grab, but it's

[jira] Resolved: (MAHOUT-242) LLR Collocation Identifier

2010-02-09 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-242. --- Resolution: Fixed Committed and resolved. LLR Collocation Identifier --

[jira] Created: (MAHOUT-286) Need to be able to run classifiers from non-text input (such as ARFF data)

2010-02-09 Thread Ted Dunning (JIRA)
Need to be able to run classifiers from non-text input (such as ARFF data) -- Key: MAHOUT-286 URL: https://issues.apache.org/jira/browse/MAHOUT-286 Project: Mahout

[jira] Updated: (MAHOUT-286) Need to be able to run classifiers from non-text input (such as ARFF data)

2010-02-09 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-286: --- Attachment: weka.log mahout.log Here are the original attachments Martin sent.

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Ted Dunning
That was my first thought as well. But I think a better answer is to mark the vector as stretchy so that it reports the high water size as the actual size, but if you insert a non-zero above that size, it will report the new high water mark thereafter. This makes the code simple and clear. The

[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-09 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831622#action_12831622 ] Ted Dunning commented on MAHOUT-153: I have been thinking about this problem a bit,

Re: core? util?

2010-02-09 Thread Ted Dunning
We (I) have had some problems with dependencies in the past. Some code seemed very util, but some other things that seemed pretty core depended on them. I think that the real issue for me is that we have two meanings of utils. One is generally useful stuff in core and the other is things that

[jira] Commented: (MAHOUT-227) Parallel SVM

2010-02-09 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831648#action_12831648 ] Ted Dunning commented on MAHOUT-227: Is this going to be complete this week or next?

Re: core? util?

2010-02-09 Thread Jake Mannix
On Tue, Feb 9, 2010 at 12:20 PM, Ted Dunning ted.dunn...@gmail.com wrote: I think that the real issue for me is that we have two meanings of utils. One is generally useful stuff in core and the other is things that use mahout to do cool things. This is my problem too: *examples* is things

[jira] Updated: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-02-09 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-180: --- Attachment: MAHOUT-180.patch Ok, ugly, dirty patch which needs to be cleaned up, but it does work,

Re: core? util?

2010-02-09 Thread Grant Ingersoll
On Feb 9, 2010, at 3:31 PM, Jake Mannix wrote: On Tue, Feb 9, 2010 at 12:20 PM, Ted Dunning ted.dunn...@gmail.com wrote: I think that the real issue for me is that we have two meanings of utils. One is generally useful stuff in core and the other is things that use mahout to do cool

Re: core? util?

2010-02-09 Thread Drew Farris
Ahh, ok this makes sense. Also as others pointed out, within 'core' are some 'small' utilities used by core that are undeserving of their own module, e.g: HadoopUtil. These generally go under the org.apache.mahout.common package. On Tue, Feb 9, 2010 at 3:43 PM, Grant Ingersoll

[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-02-09 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831678#action_12831678 ] Jeff Eastman commented on MAHOUT-270: - r908235 commits the Printable interface and

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831680#action_12831680 ] Sean Owen commented on MAHOUT-237: -- PS I think ted's suggestion that we need 'stretchable'

[jira] Issue Comment Edited: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-02-09 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831678#action_12831678 ] Jeff Eastman edited comment on MAHOUT-270 at 2/9/10 9:39 PM: -

[jira] Commented: (MAHOUT-279) Make RandomSeedGenerator a M/R Job

2010-02-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831754#action_12831754 ] Sean Owen commented on MAHOUT-279: -- Bah, it doesn't actually work in Hadoop, for reasons I

[jira] Commented: (MAHOUT-227) Parallel SVM

2010-02-09 Thread zhao zhendong (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831837#action_12831837 ] zhao zhendong commented on MAHOUT-227: -- So far, I didn't work on this parallel Binary

[jira] Commented: (MAHOUT-285) Wrap up collocation and dictionary vectorizer integration

2010-02-09 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831870#action_12831870 ] Robin Anil commented on MAHOUT-285: --- In the Colloc driver why not run DocumentProcessor

Some more dependencies

2010-02-09 Thread Robin Anil
There are some libaries in mahout only in very special place for only a few classes. Cant we do without it? all these stats are courtesy of this wonderful eclipse plugin STAN http://stan4j.com/dependencies/dependency-analysis.html Only 3 classes used for the EDU.oswego library.

Re: Some more dependencies

2010-02-09 Thread Jake Mannix
The lovely named EDU.oswego.* stuff from Doug Lea's concurrent lib I had tried really hard to figure out how to pull out when I first brought colt into the fold, but it turns out that these are parts of concurrent which didn't make it into java.util.concurrent, and so actually aren't available in