[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2012-02-22 Thread Frank Scholten (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13213483#comment-13213483 ] Frank Scholten commented on MAHOUT-944: --- Saving to Lucene indexes is a different use

Re: New Committer: Tom Pierce

2012-02-22 Thread Grant Ingersoll
Welcome aboard! On Feb 21, 2012, at 6:11 PM, Jeff Eastman wrote: The Project Management Committee (PMC) for Apache Mahout has asked Tom Pierceto become a committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since there

0.7 Priorities

2012-02-22 Thread Jake Mannix
On recent threads on the dev@ list, and discussions off-list, it's pretty clear that we need to have cleanup be a priority for the next release. How about this for a formal proposal: - The 0.7 release will have issues (both new and on JIRA) be primarily focused on bugfixes / cleanup /

Re: 0.7 Priorities

2012-02-22 Thread Ted Dunning
Aye say I. Sent from my iPhone On Feb 22, 2012, at 4:24 AM, Jake Mannix jake.man...@gmail.com wrote: If we're able to wrap this release up cleanly and get quickly moving on to new features again, maybe we can try this on a more regular basis, with even releases being feature-work, and odd

Hadoop Utils

2012-02-22 Thread Grant Ingersoll
We've collected a fair bit of Hadoop utils over the years. I am finding them generally useful in other projects. Would it make sense to either split them out to a standalone jar and/or donate them upstream to Hadoop itself? I'm thinking the things like: Seq File iterators and potentially the

Re: 0.7 Priorities

2012-02-22 Thread Shannon Quinn
I say aye. iPhone'd On Feb 22, 2012, at 9:35, Ted Dunning ted.dunn...@gmail.com wrote: Aye say I. Sent from my iPhone On Feb 22, 2012, at 4:24 AM, Jake Mannix jake.man...@gmail.com wrote: If we're able to wrap this release up cleanly and get quickly moving on to new features

Re: Hadoop Utils

2012-02-22 Thread Sean Owen
I think its fine to let them live in integration here rather than a new module. The iterators could be useful upstream yes and maybe a few more bits. The AbstractJob might still be a little too app specific. On Feb 22, 2012 2:37 PM, Grant Ingersoll gsing...@apache.org wrote: We've collected a

Re: New Committer: Tom Pierce

2012-02-22 Thread Tom Pierce
Thanks everyone! -tom On Wed, Feb 22, 2012 at 6:07 AM, Grant Ingersoll gsing...@apache.org wrote: Welcome aboard! On Feb 21, 2012, at 6:11 PM, Jeff Eastman wrote: The Project Management Committee (PMC) for Apache Mahout has asked Tom Pierceto become a committer and we are pleased to

Re: 0.7 Priorities

2012-02-22 Thread Jeff Eastman
I'll go for aye aye maties, shall we aim for end of May? On 2/22/12 7:41 AM, Shannon Quinn wrote: I say aye. iPhone'd On Feb 22, 2012, at 9:35, Ted Dunningted.dunn...@gmail.com wrote: Aye say I. Sent from my iPhone On Feb 22, 2012, at 4:24 AM, Jake Mannixjake.man...@gmail.com wrote:

Re: Helping out with the .7 release

2012-02-22 Thread Jeff Eastman
Hi Saikat, I agree with Paritosh, that a great place to begin would be to write some unit tests. This will familiarize you with the code base and help us a lot with our 0.7 housekeeping release. The new clustering classification components are going to unify many - but not all - of the

Re: Hadoop Utils

2012-02-22 Thread Ted Dunning
But is integration published as a separate jar? Sent from my iPhone On Feb 22, 2012, at 6:52 AM, Sean Owen sro...@gmail.com wrote: I think its fine to let them live in integration here rather than a new module. The iterators could be useful upstream yes and maybe a few more bits. The

Re: Hadoop Utils

2012-02-22 Thread Ted Dunning
Separate jar does mean separate maven artifact but the dependency mechanism should handle that and the new artifacts should be very stable. Sent from my iPhone On Feb 22, 2012, at 6:54 AM, Jake Mannix jake.man...@gmail.com wrote: On Wed, Feb 22, 2012 at 6:37 AM, Grant Ingersoll

Re: Helping out with the .7 release

2012-02-22 Thread Jake Mannix
So I haven't looked super-carefully at the clustering refactoring work, can someone give a little overview of what the plan is? The NewLDA stuff is technically in clustering and generally works by taking in SeqFileIW,VW documents as the training corpus, and spits out two things: SeqFileIW,VW of a

RE: Helping out with the .7 release

2012-02-22 Thread Saikat Kanjilal
Jeff,I'm pretty excited to help out with this, so as a starter can you point me to where I should begin my readings of the code, I havent looked too closely but are there certain classes in the clustering area where this refactoring effort is centered around. Regards Date: Wed, 22 Feb 2012

Re: Helping out with the .7 release

2012-02-22 Thread Jake Mannix
On Wed, Feb 22, 2012 at 10:32 AM, Jeff Eastman j...@windwardsolutions.comwrote: This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

2012-02-22 Thread Dmitriy Lyubimov (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-817: Attachment: MAHOUT-817-RC1.patch refreshing the attached patch (called RC1) to correspond

Re: Helping out with the .7 release

2012-02-22 Thread Jeff Eastman
On 2/22/12 11:58 AM, Jake Mannix wrote: On Wed, Feb 22, 2012 at 10:32 AM, Jeff Eastman j...@windwardsolutions.comwrote: This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input

Re: Helping out with the .7 release

2012-02-22 Thread Ted Dunning
I would also like to see if we can put an all reduce implementation into this effort. The idea is that we can use a map only job that does all iteration internally. I think that this could result in more than an order of magnitude speed up for our clustering codes. It could also provide

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

2012-02-22 Thread Jeff Eastman (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214091#comment-13214091 ] Jeff Eastman commented on MAHOUT-929: - Hey Paritosh, why don't you take over this

Re: Helping out with the .7 release

2012-02-22 Thread Jeff Eastman
Hi Saikat, Glad you're excited. Paritosh offered one suggestion below. You could look at TestKmeansClustering for patterns you could use to test the ClusterClassificationMapper and Driver in MR mode. That should be straightforward, but please coordinate with Paritosh so you don't duplicate

RE: Helping out with the .7 release

2012-02-22 Thread Saikat Kanjilal
Yes perfect I'll look at those and begin readings there and figure out next steps.Thanks again for your help in starting this effort. Date: Wed, 22 Feb 2012 16:25:27 -0700 From: j...@windwardsolutions.com To: dev@mahout.apache.org Subject: Re: Helping out with the .7 release Hi Saikat,

Re: Helping out with the .7 release

2012-02-22 Thread Jeff Eastman
Hey Ted, Could you elaborate on this approach? I don't grok how an all reduce implementation can be done with a map-only job, or how a mapper could do all iteration[s] internally. I've just gotten the ClusterIterator working in MR mode and it does what I thought we'd been talking about

Jenkins build is still unstable: Mahout-Quality #1361

2012-02-22 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/changes

Working on PCA tutorial. Question

2012-02-22 Thread Dmitriy Lyubimov
Hi, working on PCA section in SSVD usage . Just to confirm, if we run and svd over input with mean subtracted, then U matrix presents original data points converted to PCA space, right? thanks. -d

[jira] [Commented] (MAHOUT-933) Implement mapreduce version of ClusterIterator

2012-02-22 Thread Hudson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214175#comment-13214175 ] Hudson commented on MAHOUT-933: --- Integrated in Mahout-Quality #1362 (See

Re: Working on PCA tutorial. Question

2012-02-22 Thread Nathan Halko
Hi Dmitriy, Just a few comments: --the computed factors are approximate A \approx U\SigmaV^{T} -- the projection steps seemed transposed to me but they are consistent throughout ie. (2) \tilde{u} = \tilde{c}_{r} V \Sigma^{-1} p. 3: transpose \xi to emphasize row vector - 'mean of all

Re: Helping out with the .7 release

2012-02-22 Thread Ted Dunning
All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to accumulate gradient information without additional Map reduce iterations. The all reduce operation works by building a tree of all tasks. A state is sent up the tree from the leaves. Each internal node

Re: Helping out with the .7 release

2012-02-22 Thread Jeff Eastman
Got any code that does this I could look at? On 2/22/12 9:23 PM, Ted Dunning wrote: All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to accumulate gradient information without additional Map reduce iterations. The all reduce operation works by building

Re: Helping out with the .7 release

2012-02-22 Thread Ted Dunning
Only vw itself. Sent from my iPhone On Feb 22, 2012, at 9:01 PM, Jeff Eastman j...@windwardsolutions.com wrote: Got any code that does this I could look at? On 2/22/12 9:23 PM, Ted Dunning wrote: All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to

Jenkins build is unstable: Mahout-Quality #1363

2012-02-22 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/1363/changes

[jira] [Commented] (MAHOUT-933) Implement mapreduce version of ClusterIterator

2012-02-22 Thread Hudson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214318#comment-13214318 ] Hudson commented on MAHOUT-933: --- Integrated in Mahout-Quality #1363 (See

[jira] [Assigned] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

2012-02-22 Thread Paritosh Ranjan (Assigned) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paritosh Ranjan reassigned MAHOUT-929: -- Assignee: Paritosh Ranjan (was: Jeff Eastman) Refactor Clustering (Vector

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

2012-02-22 Thread Paritosh Ranjan (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214329#comment-13214329 ] Paritosh Ranjan commented on MAHOUT-929: Assigned to myself. I think cluster

[jira] [Issue Comment Edited] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

2012-02-22 Thread Paritosh Ranjan (Issue Comment Edited) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214329#comment-13214329 ] Paritosh Ranjan edited comment on MAHOUT-929 at 2/23/12 6:33 AM:

[jira] [Created] (MAHOUT-981) Refactor KMeans Clustering into a separate post process with outlier pruning

2012-02-22 Thread Paritosh Ranjan (Created) (JIRA)
Refactor KMeans Clustering into a separate post process with outlier pruning Key: MAHOUT-981 URL: https://issues.apache.org/jira/browse/MAHOUT-981 Project: Mahout

[jira] [Created] (MAHOUT-982) Refactor Canopy Clustering into a separate post process with outlier pruning

2012-02-22 Thread Paritosh Ranjan (Created) (JIRA)
Refactor Canopy Clustering into a separate post process with outlier pruning Key: MAHOUT-982 URL: https://issues.apache.org/jira/browse/MAHOUT-982 Project: Mahout