[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

2010-04-25 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860690#action_12860690 ] Jake Mannix commented on MAHOUT-369: Danny, thanks for looking into this so carefully

[jira] Assigned: (MAHOUT-369) Issues with DistributedLanczosSolver output

2010-04-25 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix reassigned MAHOUT-369: -- Assignee: Jake Mannix Issues with DistributedLanczosSolver output

Re: AbstractVector.minus(Vector)

2010-04-19 Thread Jake Mannix
On Mon, Apr 19, 2010 at 9:13 AM, Sean Owen sro...@gmail.com wrote: More on Vector, as I'm browsing through it: AbstractVector.minus(Vector) says: //snip The stanza after the instanceof checks can just become the body of an overriding method in these two subclasses right? Yep, sure.

Re: AbstractVector.minus(Vector)

2010-04-19 Thread Jake Mannix
, Sean Owen sro...@gmail.com wrote: On Mon, Apr 19, 2010 at 5:33 PM, Jake Mannix jake.man...@gmail.com wrote: result.times(-1.0) with result.assign(Functions.negate) Cool, good one. The efficiency points are twofold: number of nonzero elements, and the impl: you don't want to iterate

[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-19 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12858659#action_12858659 ] Jake Mannix commented on MAHOUT-364: Moving this discussion over to MAHOUT-383

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Which one is this? Wrapping Vector impls into a NamedVector/LabeledVector, or seeing if we even need the label *inside* of the Vector itself, and instead just having those live in the key part of the key-value pair in hadoop, like DistributedRowMatrix has it? -jake On Sun, Apr 18, 2010 at

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
:41 PM, Jake Mannix jake.man...@gmail.com wrote: Which one is this? Wrapping Vector impls into a NamedVector/LabeledVector, or seeing if we even need the label *inside* of the Vector itself, and instead just having those live in the key part of the key-value pair in hadoop, like

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
. Am I convincing? On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix jake.man...@gmail.com wrote: Ok this is a good con...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
What would be the Writable hierarchy with this NamedVector proposal? On Apr 18, 2010 11:05 AM, Sean Owen sro...@gmail.com wrote: On keeping 'name': sure, I ... On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix jake.man...@gmail.com wrote: Ok this is a good con...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
this a decorator pattern rather than subclass. On Sun, Apr 18, 2010 at 7:26 PM, Jake Mannix jake.man...@gmail.com wrote: What would be the Wri...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
putting the name into the vector and accepting whatever strange semantics that result (missing == instead of null, for instance) more attractive as a temporary measure. On Sun, Apr 18, 2010 at 11:44 AM, Jake Mannix jake.man...@gmail.com wrote: It's not just that it is complicated

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Jake Mannix
On Sat, Apr 17, 2010 at 2:14 PM, Robin Anil robin.a...@gmail.com wrote: For this bug, lets put the id back in and remove it from the comparator/equals. Lets focus on getting the document structure correct You mean put the 'name' back in? Since Sean has done the initial work of possibly

Re: mahout/solr integration

2010-04-16 Thread Jake Mannix
clustering algorithm so that the number of clusters does not need to be specified in advance. Has anyone done anything like this in Mahout yet? Also, I'd be happy to contribute the code to Mahout if anyone is interested. Thanks, Anthony On Fri, Apr 16, 2010 at 9:50 AM, Jake Mannix jman

Re: mahout/solr integration

2010-04-16 Thread Jake Mannix
So here's my take: once we're a TLP (next month sometime?), it is a good time to start allowing subprojects or submodules which are scripting layers on top of Mahout - whether they are PigLatin, or Cascalog, JRuby, or Clojure. If it's JVM-based, especially, having code/scripts which are drivers

Re: mahout/solr integration

2010-04-16 Thread Jake Mannix
On Fri, Apr 16, 2010 at 11:31 AM, Robin Anil robin.a...@gmail.com wrote: Hmm... this was a bit scattered of a response, but I'm really loathe to turn away a) nice hooks between Solr and Mahout, b) scripting-style wrappers which could expand our community, and c) simply new

Re: mahout/solr integration

2010-04-16 Thread Jake Mannix
On Fri, Apr 16, 2010 at 11:26 AM, Grant Ingersoll gsing...@apache.orgwrote: On Apr 16, 2010, at 2:21 PM, Jake Mannix wrote: So here's my take: once we're a TLP (next month sometime?), it is a good time to start allowing subprojects or submodules which are Submodules, yes, subprojects

Re: mahout/solr integration

2010-04-16 Thread Jake Mannix
On Fri, Apr 16, 2010 at 11:56 AM, Sean Owen sro...@gmail.com wrote: On Fri, Apr 16, 2010 at 7:39 PM, Jake Mannix jake.man...@gmail.com wrote: I will start playing around with Anthony's github-based stuff, and see where a patch can be made. The question is where it would go? It's a fully

Re: Having some trouble with SequentialAccessSparseVector.DenseVector

2010-04-15 Thread Jake Mannix
Hey Sean, On Thu, Apr 15, 2010 at 7:16 AM, Sean Owen sro...@gmail.com wrote: Along the way to a patch for MAHOUT-379, I'm having some trouble figuring out SequentialAccessSparseVector.DenseVector. I think it can be simplified, but unless I'm misunderstanding there are several bugs here. I'd

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Jake Mannix
Ok, back on list with this then (Thanks Danny for reminding us to deal with this perennial issue we have!) On Wed, Apr 14, 2010 at 2:26 AM, Sean Owen (JIRA) j...@apache.org wrote: Yeah let's take some time to get this right. At the moment I see four notions of equivalence in Vector (which is

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Jake Mannix
+1 -jake On Apr 14, 2010 3:20 PM, Jeff Eastman j...@windwardsolutions.com wrote: Ted Dunning wrote: On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen sro...@gmail.com wrote: ... +1 from the creator thereof, even. Especially since they never got used.

[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-13 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856711#action_12856711 ] Jake Mannix commented on MAHOUT-364: Zoran, Any form of BSD-style license _is_

Re: Status of Mahout TLP

2010-04-12 Thread Jake Mannix
From what Grant said last time we talked about this, we need to wait until the next Apache directors meeting (or whatever it's called) before we move forward with that, I thought. -jake On Mon, Apr 12, 2010 at 2:43 PM, Robin Anil robin.a...@gmail.com wrote: Hi everyone, I am

[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-11 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855742#action_12855742 ] Jake Mannix commented on MAHOUT-364: Hi Zoran, Neuroph looks very interesting

Re: [jira] Updated: (MAHOUT-376) Implement Map-reduce version of stochastic SVD

2010-04-11 Thread Jake Mannix
I haven't had a chance to read your attached pdf, but I *have* had a chance to code up an impl of this jira. Patch coming soon. On Apr 11, 2010 6:50 AM, Ted Dunning (JIRA) j...@apache.org wrote: [

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854950#action_12854950 ] Jake Mannix commented on MAHOUT-363: If possible, Shannon, if you could simply add

[jira] Commented: (MAHOUT-369) Issues with DistributedLanczosSolver output

2010-04-08 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855014#action_12855014 ] Jake Mannix commented on MAHOUT-369: Hold on that Sean, I made the loop like

Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Jake Mannix
I agree in principal, but having a whole different set of versionings seems kinda... messy? If m-collections goes 1.0, and then 1.1, and then m-math goes 1.0, and core goes to 0.5, we have a whole pile of different version numbers to keep track of. Didn't Lucene and Solr just intentionally do

[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-04-06 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854304#action_12854304 ] Jake Mannix commented on MAHOUT-364: I've got to say, this is a fantastically well

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

2010-04-05 Thread Jake Mannix
Hi Richard, A few notes about what would be required to get a nice distributed SVD recommender in Mahout: if you look at the current distributed recommenders (in org.apache.mahout.cf.taste.hadoop package and children), you can see how it works: using HDFS-backed data, a batch of

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853308#action_12853308 ] Jake Mannix commented on MAHOUT-363: ... and actually, there is no need for Hama

Re: svn commit: r930796 - in /lucene/mahout/trunk/math: ./ src/main/java/org/apache/mahout/math/ src/main/java/org/apache/mahout/math/decomposer/hebbian/ src/main/java/org/apache/mahout/math/decompo

2010-04-04 Thread Jake Mannix
Umm, I actually depend pretty heavily on the logging in the SVD solvers. They are very long-running processes, and give off a ton of useful information about what the heck is going on. Reducing dependencies is great, but logging? I think the math stuff could really use logging. I haven't been

Re: svn commit: r930796 - in /lucene/mahout/trunk/math: ./ src/main/java/org/apache/mahout/math/ src/main/java/org/apache/mahout/math/decomposer/hebbian/ src/main/java/org/apache/mahout/math/decompo

2010-04-04 Thread Jake Mannix
thought but for me collections and Math are just tools to aid complex algorithms in Mahout core. Maybe we can move it under core and adding the required logging. Robin On Mon, Apr 5, 2010 at 11:03 AM, Jake Mannix jake.man...@gmail.com wrote: Umm, I actually depend pretty heavily

Re: svn commit: r930796 - in /lucene/mahout/trunk/math: ./ src/main/java/org/apache/mahout/math/ src/main/java/org/apache/mahout/math/decomposer/hebbian/ src/main/java/org/apache/mahout/math/decompo

2010-04-04 Thread Jake Mannix
thanks. On Sun, Apr 4, 2010 at 10:40 PM, Sean Owen sro...@gmail.com wrote: Oh OK I'll revert the change then, didn't know you wanted that. Some of the other statements could probably go but not worth digging through it. On Mon, Apr 5, 2010 at 6:33 AM, Jake Mannix jake.man...@gmail.com wrote

[jira] Commented: (MAHOUT-350) add one JobName and reduceNumber parameter to org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

2010-04-01 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852477#action_12852477 ] Jake Mannix commented on MAHOUT-350: bq. I suppose I hadn't wanted to be presumptuous

Re: Javadocs?

2010-03-30 Thread Jake Mannix
Awesome, thanks guys. Doesn't Maven do this kind of thing for us, if we tell it to? (ie can't we also have daily updates of the 0.4-SNAPSHOT javadocs automagically posted up there too?) -jake On Tue, Mar 30, 2010 at 6:28 AM, Sean Owen sro...@gmail.com wrote: Done, they're all up under

[jira] Commented: (MAHOUT-350) add one JobName and reduceNumber parameter to org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

2010-03-29 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851015#action_12851015 ] Jake Mannix commented on MAHOUT-350: Don't the jobs which implement Tool allow

Javadocs?

2010-03-29 Thread Jake Mannix
Hey gang, Where are the 0.3 javadocs on the web? All I can find right now are the 0.1's http://lucene.apache.org/mahout/javadoc/core/index.html. -jake

Re: [VOTE] Mahout as TLP

2010-03-19 Thread Jake Mannix
+1 -jake

Re: Significance of name in AbstractVector

2010-03-18 Thread Jake Mannix
Hi Pallavi, I personally agree that keeping the name as part of the mathematical vector is wrong, because it leads to not only the issues you've brought up, but also means we still have these *4* different ways of saying that two vectors are the same: ==, equals(), equivalent(), and

[jira] Commented: (MAHOUT-337) Don't serialize cached length squared in JSON vector representation

2010-03-15 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845434#action_12845434 ] Jake Mannix commented on MAHOUT-337: So a question about this: do we really want to do

Re: [NOMINATION] Sean Owen as Mahout PMC Chair

2010-03-15 Thread Jake Mannix
+1 from over here. On Mon, Mar 15, 2010 at 11:36 AM, Drew Farris drew.far...@gmail.com wrote: +1 as well. On Mon, Mar 15, 2010 at 2:34 PM, Ted Dunning ted.dunn...@gmail.com wrote: Dang. I can only second second it now. On Mon, Mar 15, 2010 at 11:28 AM, Robin Anil robin.a...@gmail.com

[jira] Commented: (MAHOUT-337) Don't serialize cached length squared in JSON vector representation

2010-03-15 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845495#action_12845495 ] Jake Mannix commented on MAHOUT-337: [quote] Yes it's possible to fix this by forcing

[jira] Updated: (MAHOUT-322) DistributedRowMatrix should live in SequenceFileWritable,VectorWritable instead of SequenceFileIntWritable,VectorWritable

2010-03-07 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-322: --- Fix Version/s: (was: 0.3) pulling this out of the track for 0.3 DistributedRowMatrix should

[jira] Resolved: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-03-05 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix resolved MAHOUT-315. Resolution: Fixed Fix Version/s: (was: 0.4) 0.3 Committed

Re: Who owns mahout bucket on s3?

2010-03-05 Thread Jake Mannix
On Thu, Mar 4, 2010 at 7:41 AM, Robin Anil robin.a...@gmail.com wrote: Based on what i have in mind, the usage will just be mahout vectorize -i s3://input -o s3://output -tmp hdfs://file (here, there is a risk of fixing a exact path and not knowing the hadoop user, I would have preferred a

Re: svd algorithms

2010-03-05 Thread Jake Mannix
Hi Mike, Welcome to the long journey down the road of dimensional reduction. :) On Fri, Mar 5, 2010 at 5:05 PM, mike bowles m...@mbowles.com wrote: Really large matrices require using one of the randomizing methods to get done. Require is a strong term. Really really large (but still

[jira] Resolved: (MAHOUT-310) LanczosSolver and DistributedLanczosSolver always assume rectangular input, but should also handle symmetric eigensystems.

2010-03-05 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix resolved MAHOUT-310. Resolution: Fixed Fix Version/s: 0.3 committed LanczosSolver and DistributedLanczosSolver

[jira] Resolved: (MAHOUT-313) DistributedRowMatrix needs times(Vector) implementation as M/R job

2010-03-05 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix resolved MAHOUT-313. Resolution: Fixed Fix Version/s: 0.3 Committed, code piggybacks on timesSquared

[jira] Resolved: (MAHOUT-314) DistributedRowMatrix needs a sparse DistributedRowMatrix times(DistributedRowMatrix other) implementation

2010-03-05 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix resolved MAHOUT-314. Resolution: Fixed Fix Version/s: 0.3 Committed. Current implementation is a map-side

[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFileWritable,VectorWritable instead of SequenceFileIntWritable,VectorWritable

2010-03-05 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12842213#action_12842213 ] Jake Mannix commented on MAHOUT-322: It should actually be noted that Danny's original

[jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFileWritable,VectorWritable instead of SequenceFileIntWritable,VectorWritable

2010-03-04 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841382#action_12841382 ] Jake Mannix commented on MAHOUT-322: Meaning what, Robin? We can certainly come up

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFileWritable,VectorWritable instead of SequenceFileIntWritable,VectorWritable

2010-03-04 Thread Jake Mannix
On Thu, Mar 4, 2010 at 8:54 AM, Ted Dunning ted.dunn...@gmail.com wrote: I haven't examined the out-of-core scenarios at all, but in-memory, it is possible to have labels with no performance cost if you assume add the constraint that labeled matrices are only conformable if they share the

Re: Assign hack slowdown

2010-03-02 Thread Jake Mannix
Adding a skipZero() method to all the functions is probably better here, because that will be faster than an instanceof check, and easier to document than other interfaces. On Tue, Mar 2, 2010 at 1:22 AM, Sean Owen sro...@gmail.com wrote: How about merely a flag/method on BinaryFunction /

Re: Assign hack slowdown

2010-03-02 Thread Jake Mannix
On Tue, Mar 2, 2010 at 5:21 AM, Sean Owen sro...@gmail.com wrote: I'll have a look there. May be worth piling in one more little thing like this in the 'code freeze'. Incidentally Hadoop announced version 0.20.2 a few days ago -- still looking for it on Maven but I will be starting up our

[jira] Resolved: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-03-02 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix resolved MAHOUT-301. Resolution: Fixed Checked in a version of this which works, not sure if it had the most updated

[jira] Updated: (MAHOUT-311) Update assemblies to include components of launcher script from MAHOUT-301

2010-03-02 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-311: --- Resolution: Fixed Status: Resolved (was: Patch Available) committed Update assemblies

The new improved command-line: MahoutDriver (get it?)

2010-03-02 Thread Jake Mannix
Hey all, Just an update on the new-and-improved command-line UI we have now. After a ton of iterations back and forth with Drew (thanks!), MAHOUT-301 has been committed, and brings with it the easy ability to trim down your long long command lines for most of our *Driver main() methods, by

[jira] Updated: (MAHOUT-310) LanczosSolver and DistributedLanczosSolver always assume rectangular input, but should also handle symmetric eigensystems.

2010-03-01 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-310: --- Attachment: MAHOUT-lots.diff I hope we get this release out soon, I've got a giant pile of code

Re: [jira] Created: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-03-01 Thread Jake Mannix
. It does this I think, though it has not yet been wired into ClusterDumper.printClusters. I wanted to give the ClusterDumper users a chance to critique my formatting but it is like the below. Jeff Jake Mannix (JIRA) wrote: VectorDumper should also do printing to simple {index

Re: Who owns mahout bucket on s3?

2010-02-28 Thread Jake Mannix
I thought you were doing the secondary sort idea? That's certainly the way to make sure you need nothing significant kept in memory, and this clearly won't scale without that optimization... I'd say this should get fixed before we release 0.3 -jake On Sun, Feb 28, 2010 at 7:30 AM, Drew

[jira] Created: (MAHOUT-312) DistributedRowMatrix iterateAll() and iterate() don't work on multi-part SequenceFiles

2010-02-28 Thread Jake Mannix (JIRA)
Project: Mahout Issue Type: Bug Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix DistributedRowMatrixIterator does not properly handle file glob paths of the various part-0 files. -- This message is automatically generated by JIRA. - You can

[jira] Created: (MAHOUT-313) DistributedRowMatrix needs times(Vector) implementation as M/R job

2010-02-28 Thread Jake Mannix (JIRA)
Feature Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix pretty self-explanatory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.

[jira] Created: (MAHOUT-314) DistributedRowMatrix needs a sparse DistributedRowMatrix times(DistributedRowMatrix other) implementation

2010-02-28 Thread Jake Mannix (JIRA)
/jira/browse/MAHOUT-314 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix If the matrix which is being multiplied by has been transformed into a column-sparse matrix backed

[jira] Created: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-02-28 Thread Jake Mannix (JIRA)
URL: https://issues.apache.org/jira/browse/MAHOUT-315 Project: Mahout Issue Type: Improvement Affects Versions: 0.2 Reporter: Jake Mannix Assignee: Jake Mannix Fix For: 0.4 I've got a patch for this, tied up in other code

[jira] Created: (MAHOUT-316) CardinalityException and IndexException should remove the default constructor, and always construct with arguments saying what the error was

2010-02-28 Thread Jake Mannix (JIRA)
was Key: MAHOUT-316 URL: https://issues.apache.org/jira/browse/MAHOUT-316 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.2 Reporter: Jake Mannix Fix For: 0.4 CardinalityException already has

Re: Who owns mahout bucket on s3?

2010-02-28 Thread Jake Mannix
What's the final size of the vectoized output? -jake On Feb 28, 2010 6:47 PM, Robin Anil robin.a...@gmail.com wrote: Finally some good news tried with cloudera 4 node c1.medium on 6 GB compressed(26GB uncompressed wikipeda) org.apache.mahout.text.SparseVectorsFromSequenceFiles -i wikipedia/

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
Hey Robin, Couple questions: what is the contents of this sequence file? Is this the output of the SparseVectorsFromSequenceFiles? Do you know the number of key-value pairs, and the cardinality of the rows? Or is this just the Text,Text raw contents sequence files? Also - how do we get

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
Hey Robin, that http url gives me a permission denied response... I'm not too S3 savvy, not sure if I'm checking on it right... On Sat, Feb 27, 2010 at 12:40 PM, Robin Anil robin.a...@gmail.com wrote: Its uploaded here and its public. I will monitor usage and see if my credits dont get run

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
the url you tried On Sun, Feb 28, 2010 at 2:59 AM, Jake Mannix jake.man...@gmail.com wrote: Hey Robin, that http url gives me a permission denied response... I'm not too S3 savvy, not sure if I'm checking on it right... On Sat, Feb 27, 2010 at 12:40 PM, Robin Anil robin.a...@gmail.com wrote

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
On Sun, Feb 28, 2010 at 3:04 AM, Jake Mannix jake.man...@gmail.com wrote: Er, the one you posted! http://mahout-wikipedia.s3.amazonaws.com/wikipedia-jan-2010-seqfile-deflate-chunk-[0-5] http://mahout-wikipedia.s3.amazonaws.com/wikipedia-jan-2010-seqfile-deflate-chunk-[0-5

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
15GB of tokenized documents, not bad, not bad. We're not going to get a multi-billion entry matrix out of this though, are we? -jake On Sat, Feb 27, 2010 at 2:06 PM, Robin Anil robin.a...@gmail.com wrote: Update: in 20 mins the tokenization stage is complete But its not evident in the

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
said only 5 mil articles. Maybe you can generate a co-occurrence matrix :) every ngram to every other ngram :) Sounds fun? It will be HUGE! On Sun, Feb 28, 2010 at 3:43 AM, Jake Mannix jake.man...@gmail.com wrote: 15GB of tokenized documents, not bad, not bad. We're not going to get a multi

Re: Who owns mahout bucket on s3?

2010-02-27 Thread Jake Mannix
. So bye Robin On Sun, Feb 28, 2010 at 3:57 AM, Robin Anil robin.a...@gmail.com wrote: like i said only 5 mil articles. Maybe you can generate a co-occurrence matrix :) every ngram to every other ngram :) Sounds fun? It will be HUGE! On Sun, Feb 28, 2010 at 3:43 AM, Jake Mannix

[jira] Created: (MAHOUT-310) LanczosSolver and DistributedLanczosSolver always assume rectangular input, but should also handle symmetric eigensystems.

2010-02-25 Thread Jake Mannix (JIRA)
URL: https://issues.apache.org/jira/browse/MAHOUT-310 Project: Mahout Issue Type: Improvement Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix LanczosSolver calls inputMatrix.timesSquared(Vector) as it's Krylov iteration

[jira] Updated: (MAHOUT-310) LanczosSolver and DistributedLanczosSolver always assume rectangular input, but should also handle symmetric eigensystems.

2010-02-25 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-310: --- Attachment: MAHOUT-310.patch Patch has newly modified unit tests to test the symmetric case

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
Hmm... code: *check* desire to add stochastic decomp to code: *check* amazon credits: *check* (my account today: almost $300 left burning hole in pocket) relatively gigantic social graph: *check* legal ability to put gigantic social graph on ec2: not so check, but maybe some clever anonymization

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
On Thu, Feb 25, 2010 at 12:38 PM, Robin Anil robin.a...@gmail.com wrote: Whats the largest dataset available? BixoLabs ? Wikipedia(5 Mil articles)... I dont know anything public that is that big 5 million articles, if you take all the 1,2,3,4, and 5-grams data out of it, you could easily hit

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
On Thu, Feb 25, 2010 at 12:42 PM, Grant Ingersoll gsing...@apache.orgwrote: I'd be a little wary of that and I'd hate to see anything happen to it (AOL comes to mind). That being said, if you just export the vectors w/o the key, it really is pretty anonymous.What other sources can we

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
On Thu, Feb 25, 2010 at 12:49 PM, Robin Anil robin.a...@gmail.com wrote: unigrams 3 = 384 MB dictionary... with all ngrams(pruned by llr 1) we might hit some 5-10GB of entries. With some 25 char average for 5 grams it might be safe to say that we might say hit 100 million rows easily ? Wait

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
: Stochastic decomposition doesn't care about this, I don't think. On Thu, Feb 25, 2010 at 1:43 PM, Jake Mannix jake.man...@gmail.com wrote: Of course, at this point we've got too many terms to properly do the decomposition directly on the input matrix, -- Ted Dunning, CTO DeepDyve

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
in yet, but it's a pretty critically useful enhancement to DistributedSparseRowMatrix which we need anyways. -jake -jake On Thu, Feb 25, 2010 at 1:43 PM, Jake Mannix jake.man...@gmail.com wrote: Of course, at this point we've got too many terms to properly do the decomposition

Re: anybody want to set a record with Mahout?

2010-02-25 Thread Jake Mannix
On Thu, Feb 25, 2010 at 2:09 PM, Jake Mannix jake.man...@gmail.com wrote: On Thu, Feb 25, 2010 at 1:48 PM, Ted Dunning ted.dunn...@gmail.comwrote: After we delete hapax, we may have considerably fewer tokens. But the LLR step that Robin implied may have already dealt with that. The more I

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-25 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838725#action_12838725 ] Jake Mannix commented on MAHOUT-301: Drew, do you have a patch with your last changes

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-24 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837917#action_12837917 ] Jake Mannix commented on MAHOUT-301: Awesome Drew, I'll check it out. {quote} One

[jira] Created: (MAHOUT-308) Improve Lanczos to handle extremely large feature sets (without hashing)

2010-02-24 Thread Jake Mannix (JIRA)
Type: Improvement Components: Math Affects Versions: 0.3 Environment: all Reporter: Jake Mannix Assignee: Jake Mannix Fix For: 0.4 DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the driver (client) computer while

[jira] Commented: (MAHOUT-308) Improve Lanczos to handle extremely large feature sets (without hashing)

2010-02-24 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838069#action_12838069 ] Jake Mannix commented on MAHOUT-308: Of course, making sure that individual mappers

[jira] Created: (MAHOUT-309) Implement Stochastic Decomposition

2010-02-24 Thread Jake Mannix (JIRA)
Reporter: Jake Mannix Assignee: Jake Mannix Fix For: 0.4 Techniques reviewed in a href=http://arxiv.org/abs/0909.4061;Halko, Martinsson, and Tropp/a. The basic idea of the implementation is as follows: if the input matrix is represented

Re: [Fwd: Re: About Display Code]

2010-02-24 Thread Jake Mannix
why is this not showing up in the unit tests? On Wed, Feb 24, 2010 at 6:36 PM, Jeff Eastman jeast...@windwardsolutions.com wrote: AbstractVector.minus has a bug in the first if clause. Don't know if my fix or this one would do what is intended by the optimization:

Re: [Fwd: Re: About Display Code]

2010-02-24 Thread Jake Mannix
On Wed, Feb 24, 2010 at 6:43 PM, Jeff Eastman j...@windwardsolutions.comwrote: The unit test is subtracting a vector from itself and testing for zero :) Egads!

[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-24 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-301: --- Attachment: MAHOUT-301.patch Improve command-line shell script by allowing default properties files

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-24 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12838159#action_12838159 ] Jake Mannix commented on MAHOUT-301: Ok, new patch, with the modification that indeed

[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-24 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-301: --- Fix Version/s: (was: 0.4) 0.3 Let's release this. Others want to try it out

[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-02-23 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837324#action_12837324 ] Jake Mannix commented on MAHOUT-180: Hi Danny, thanks for trying this out! You have

Re: 0.3 release issues

2010-02-23 Thread Jake Mannix
So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. It's

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837345#action_12837345 ] Jake Mannix commented on MAHOUT-301: Hey Drew, thanks for looking at this. Problems

Re: 0.3 release issues

2010-02-23 Thread Jake Mannix
...@gmail.com wrote: WHat about a new follow-on JIRA so 301 can stay in the official release notes? On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837351#action_12837351 ] Jake Mannix commented on MAHOUT-301: Ok, Drew, got your patch in diff mode against mine

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837428#action_12837428 ] Jake Mannix commented on MAHOUT-301: {quote} Something else I noticed

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837440#action_12837440 ] Jake Mannix commented on MAHOUT-301: {quote} Jake, the basic idea is that you would

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837472#action_12837472 ] Jake Mannix commented on MAHOUT-301: {quote} Ahh, I see where you're coming from, so

  1   2   3   4   5   >