Speed up Frequent Compile

2010-02-05 Thread Robin Anil
When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Attachment: MAHOUT-237-tfidf.patch 4 Main Entry points DocumentProcessor - does SequenceFile =

Re: Mahout 0.3 Plan and other changes

2010-02-05 Thread Robin Anil
I am committing the first level of changes so that drew can work it. I have updated the patch on the issue as a reference. Ted please take a look when you get time. The names will change correspondingly What I have right now is 4 Main Entry points DocumentProcessor - does SequenceFile =

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Status: Patch Available (was: Reopened) Working Implementation DictionaryVectorizer using with tf,

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Resolution: Fixed Status: Resolved (was: Patch Available) Map/Reduce Implementation of

[jira] Resolved: (MAHOUT-220) Mahout Bayes Code cleanup

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-220. --- Resolution: Fixed Committed. Mahout Bayes Code cleanup -

[jira] Resolved: (MAHOUT-221) Implementation of FP-Bonsai Pruning for fast pattern mining

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-221. --- Resolution: Fixed Committed Implementation of FP-Bonsai Pruning for fast pattern mining

[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056 ] Robin Anil commented on MAHOUT-153: --- Any progress on this? Will it be ready soon or

Re: Release thinking

2010-02-05 Thread Robin Anil
Reviving this thread. Copy paste the whole thing as we move forward Current Snapshot Key Summary MAHOUT-221 Implementation of FP-Bonsai Pruning for fast pattern mining Done MAHOUT-227 Parallel SVM In Progress MAHOUT-240 Parallel version of Perceptron Little Progress

[jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830077#action_12830077 ] Robin Anil commented on MAHOUT-185: --- I like the script as i am running k-means these days

Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Grant Ingersoll
One thought on these lines is that we should start the process to be a TLP, then we could have a subproject explicitly dedicated to C++ (or any other language) and there wouldn't necessarily need to be a 1-1 port. -Grant On Feb 5, 2010, at 12:56 AM, Kay Kay wrote: If there were an effort to

Re: Release thinking

2010-02-05 Thread Ted Dunning
I just marked the 0.1 and 0.2 releases as released (about time). This makes the JIRA road map feature more usable. See here for the live version of this summary: https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel On Fri, Feb 5, 2010 at

Re: [jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-02-05 Thread Ted Dunning
Surely there is a clever way to use annotations for this. Not that I know what it might be. On Fri, Feb 5, 2010 at 4:05 AM, Robin Anil (JIRA) j...@apache.org wrote: If we go like this we might have too many options. Any way to streamline this ? One thought i have is to have package level

Re: Release thinking

2010-02-05 Thread Robin Anil
Yum Yum. 0.1 59 issues 0.2 66 issues 0.3 91 issues - 13 left On Fri, Feb 5, 2010 at 9:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: I just marked the 0.1 and 0.2 releases as released (about time). This makes the JIRA road map feature more usable. See here for the live version of

[jira] Created: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-05 Thread Drew Farris (JIRA)
Use avro for serialization of structured documents. --- Key: MAHOUT-274 URL: https://issues.apache.org/jira/browse/MAHOUT-274 Project: Mahout Issue Type: Improvement Reporter: Drew

[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-05 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-274: --- Attachment: mahout-avro-examples.tar.gz Very rudimentary exploration of using avro to produce

Re: Release thinking

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 11:17 AM, Ted Dunning ted.dunn...@gmail.com wrote: I just marked the 0.1 and 0.2 releases as released (about time).  This makes the JIRA road map feature more usable. See here for the live version of this summary:

Re: Speed up Frequent Compile

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop

Re: Speed up Frequent Compile

2010-02-05 Thread Ted Dunning
I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works

Re: Release thinking

2010-02-05 Thread Ted Dunning
Makes a lot of sense. Drew? On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote: So are we really planning on all this structured document stuff and Avro for 0.3? Can we just try and finish up what was already scoped for 0.3 and have a quick turnaround for getting

Re: Release thinking

2010-02-05 Thread Jake Mannix
On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote: So are we really planning on all this structured document stuff and Avro for 0.3? Can we just try and finish up what was already scoped for 0.3 and have a quick turnaround for getting things which have only been really

Re: Release thinking

2010-02-05 Thread Drew Farris
Sounds great to me. On Fri, Feb 5, 2010 at 11:50 AM, Ted Dunning ted.dunn...@gmail.com wrote: Makes a lot of sense.  Drew? On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote: So are we really planning on all this structured document stuff and Avro for 0.3?  Can we just

Re: Release thinking

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 11:53 AM, Jake Mannix jake.man...@gmail.com wrote: Which is not to say that we shouldn't continue work on them, let's keep the patches going and up to date, let's just not worry about holding up 0.3 until they're fully tested and checked in. Yes absolutely. I'm also

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
mvn install to generate the job. around 2-3 mins it generates the bz2 zip gz mvn compile otherwise(15 secs are in compiling math) out of 33 sec On Fri, Feb 5, 2010 at 10:18 PM, Drew Farris drew.far...@gmail.com wrote: On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil robin.a...@gmail.com wrote:

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which

Re: Release thinking

2010-02-05 Thread Robin Anil
I just updated it here. http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html Lets rename/refactor the classes and get basic avro thing in for 0.3. So that people who use gets a smooth upgrade to 0.4 Robin On Fri, Feb 5, 2010 at 10:32 PM, Drew Farris drew.far...@gmail.com wrote: On

[jira] Updated: (MAHOUT-272) Add licenses for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-272: - Resolution: Fixed Assignee: Drew Farris Status: Resolved (was: Patch Available) Add

Re: Speed up Frequent Compile

2010-02-05 Thread Drew Farris
So, I'm running: mvn -o install -DskipTests=true at project root (in mahout) Comment out or remove the maven-assembly-plugin definition in core/pom.xml -- it reduced my core build time from 26s to 6s -- I can submit a patch for this. Mahout math is still 17s here due to code generation. I'm

Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Israel Ekpo
Thanks everyone for your responses so far. The Apache Hadoop dependency was something I thought about initially but I still went ahead to ask the question anyways. At this time, it would be a better use of resources and time to come up with a wrapper or HTTP server/client set up of some sort.

Re: Speed up Frequent Compile

2010-02-05 Thread Benson Margulies
Yes, the codegen could drop a timestamp file. It's a fair amount of work, and if we're killing this code for HPCC I'm dubious. If I could make the split work I could do this next. On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com wrote: So, I'm running: mvn -o install

Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Israel Ekpo
Grant, Would the TLP be Mahout or under a different name? I also like the idea that it does not necessarily have to be a 1:1 port. Kay Kay, I change my mind (going the wrapper route), I think it would be nice to explore the possibilities with just a subset of the algorithms. That would be a

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Its just meant to be a dev only hack :) On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies bimargul...@gmail.comwrote: Yes, the codegen could drop a timestamp file. It's a fair amount of work, and if we're killing this code for HPCC I'm dubious. If I could make the split work I could do this

Re: Speed up Frequent Compile

2010-02-05 Thread Benson Margulies
Then we could make a profile that turns off the code gen and turns on the build helper to add the generated source dir instead. On Fri, Feb 5, 2010 at 4:49 PM, Robin Anil robin.a...@gmail.com wrote: Its just meant to be a dev only hack :) On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-05 Thread Jeff Eastman
Jeff Eastman wrote: Jeff Eastman wrote: Jeff Eastman wrote: Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha.