[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-24 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35926018 @mengxr I don't think we need to change the license now since it's optional... I think the LGPL are the least contentious of the possible licenses at play here :-

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-24 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35918773 @fommil Either AL2 or MPL should work. We only need appropriate labeling for MPL, which is trivial. And thanks for the suggestion of making native libraries optio

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-24 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35907015 this also means that the JNILoader license simply doesn't matter anymore, which saves me from having to issue a new release. --- If your project is set up for it

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-24 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35906687 @fommil fine by me. I'll get on it. On Feb 24, 2014 4:07 AM, "Sam Halliday" wrote: > Hi all, > > The discussions with ASF on the LEGAL ticket ha

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-24 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35879891 Hi all, The discussions with ASF on the LEGAL ticket has exposed some concerns - **unrelated to the LGPL** - that I think everybody needs to be aware

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35844934 Actually, if somebody creates a ticket for me on https://github.com/fommil/jniloader that's the best way to ensure that I'll actually update the license and relea

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35844613 @srowen hehe, oh, I know. Actually I'm more interested in knowing exactly *why* they don't like LGPL. There have been so many discussions in the past between FSF

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35842233 @srowen @fommil Breeze is flexible enough that we can swap out different back ends quickly (and let users decide at runtime). So if need be, I can do the work to ma

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35842122 @fommil ASF is silent on the MPL: http://www.apache.org/legal/resolved.html#category-a But Mozilla says it's compatible with AL2: http://www.mozilla.org/MPL/l

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35839024 @mengxr looking through all the Apache authorised licenses, it would appear that the Mozilla license is a better fit with my goals since it would require distribu

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35557645 @fommil Thanks a lot! The license JIRA is also interesting to follow ~ :) --- If your project is set up for it, you can reply to this email and have your reply ap

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-19 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35548061 @srowen I've asked the question. I'm interested to see the response: https://issues.apache.org/jira/browse/LEGAL-192 --- If your project is set up for it, you ca

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-19 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35547276 @srowen "The LGPL is ineligible primarily due to the restrictions it places on larger works, violating the third license criterion. Therefore, LGPL-licensed works

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-19 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35546981 @mengxr consider this message to be proof that jniloader is distributed under the Apache license. I'll update the build files next time I need a code change. If y

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35450646 @mengxr thanks for doing all this! It's nice to see that the overhead in Breeze is largely negligible as compared to MTJ (and maybe even slightly better sometimes

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35449886 @fommil @MLnick I included MTJ into the benchmarks (see the updated comment above). Basically it performs very similar to breeze. @martinjaggi Gradient ba

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread martinjaggi
Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35448685 Thanks @mengxr for the benchmark efforts! Just not sure if you got my comment about part 2) in the benchmark, k-means: In my opinion this algorithm is not ve

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35447529 For what it's worth, I like the idea of using breeze, even though I know little about it. Mostly, I like the idea of using something consistent most of all, and f

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35446662 Actually, @non it's worth you casting an eye over this discussion as the primary author of spire --- If your project is set up for it, you can reply to this emai

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35446562 @rxin I've not seen implicits to cause a problem, except in high frequency scenarios. I suspect Breeze might have suffered from auto boxing, brought on by implici

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35445845 @fommil maybe that changed in 2.10.x entirely given the addition of value classes, and maybe Breeze is very careful in its implicit usage, but often implicits in sc

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35444959 @MLnick I agree, Breeze and Spire are a solid foundation for numerics in Scala. --- If your project is set up for it, you can reply to this email and have your r

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35444819 @mengxr that's using veclib. Re: implicits and breeze, I don't know why you think it's a problem. Implicits are a compile time feature and combined with v

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35443064 @mengxr ok, that is interesting. I have always advocated for Breeze, but was told 6 months ago that it was a non-starter due to

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35442089 @fommil I have the native vecLib BLAS/LAPACK shipped with Mac OS X and OpenBLAS installed for testing. OpenBLAS is not on the search path. I deleted both and re-r

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35438611 @mengxr at the risk of adding to your workload... I think (license issues aside since I suppose both Breeze and MTJ are affected

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35438198 @mengxr no MTJ in the benchmarks? ;-) Given that it has the most mature sparse library on the JVM, I am surprised that you omitted it. --- If your project is set

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35437895 @mengxr given that the JNILoader will likely be loading proprietary native implementations of BLAS/LAPACK, I consider it to be a moot point... but if the Apache f

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35437116 @mengxr `netlib-java` uses **your** system optimised natives (if they are installed). So, back at you, what do you have installed? See the main page for b

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35433131 Thanks all for the suggestions! @srowen @giyengar I updated the small benchmark suite to include commons-math3. It seems to me commons-math3 has couple d

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread martinjaggi
Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35220431 @dlwh actually i think it's the same story in structured prediction (SGD or BCFW), immediate updates on the vector are usually faster for the local machine.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35220185 @martinjaggi I've often found that minibatching makes things converge much more quickly, since you get a nice variance reduction in the estimate of the gradie

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread martinjaggi
Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35219684 @dlwh Thanks! This is of course a nice idea. Perhaps surprisingly (and good for us) such tricks seem not even necessary in the current state of the art algor

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/incubator-spark/pull/575#discussion_r9779496 These factory methods can probably just be called `dense`, `sparse`, etc. --- If your project is set up for it, you can reply to this email and have your reply appe

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35218872 @martinjaggi For how it's usually implemented, that's right. But you can quite likely get better performance doing minibatches with dense vector/CSC multiply

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread martinjaggi
Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35218573 @fommil No matrix operations are performed at all so far, only vector addition (of type dense += sparse). See the code in this PR by @mengxr . Vector operati

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35218098 @martinjaggi I'm happy to advise on what the best sparse format would be for any particular problem that you're wanting to solve in spark. just let me know the ma

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread martinjaggi
Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35217718 Hope you don't get me wrong, I was not at all proposing to fix a single scheme, neither for serialization, or for the choice of sparse library. I was just su

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35216981 @martinjaggi I believe you would be making a massive mistake by agreeing on a single serialisation scheme for sparse vectors, unless that format is independent of

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-16 Thread martinjaggi
Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35212055 Really looking forward to having sparse vectors in MLlib soon, this is super important! And thanks for your efforts so far! Just a quick comment abou

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35141897 we use mahout-math in a scalding for similar purposes. see here: https://github.com/tresata/ganitha the main motivation for us was the sparse vec

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35132098 @fommil Yes, I mentioned the benchmark suite from Peter to @srowen in my previous comment, but it is designed for dense linear algebra. I put some of the code I u

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread mikiobraun
Github user mikiobraun commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35128801 @fommil apology accepted ;) and yeah, sorry about what I said at ICML, I was misinformed. I think we can call it even! ;)

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35126575 @MLnick btw, I recently added some new ARPACK and linked-sparse matrix structures to MTJ that should give you all some ideas. BTW, I also recently created a "fast

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35126301 @mikiobraun apologies for misquoting you: I have obviously inferred too much from your commendation of `netlib-java`'s recent updates, and when we discussed a pot

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35105330 Just to follow up on Breeze performance: in the latest snapshot, we are consistently faster than JBlas and Mahout in @mengxr's benchmarks. On Fri, F

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread mikiobraun
Github user mikiobraun commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35073967 @fommil @MLnick Hello, I'm the author of jblas, and I'd like to clarify what fommil said about netlib-java exceeding my original goals for JBLAS, because I do

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35070192 I for one would prefer to use netlib-java, whether it is via MTJ or Breeze. I've used both and find the MTJ API pretty good and with good sparse support (th

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35069073 @MLnick I have it on good authority (from the author of JBLAS) that he consider `netlib-java` to exceed his original goals for JBLAS. I am utterly confused why co

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35055610 Originally sent this to dev list not github - the autopsy emails are a bit confusing on mobile :) @fommil @mengxr I think it's always worth

Re: [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread Nick Pentreath
@fommil @mengxr I think it's always worth a shot at a license change. Scikit learn devs have been successful before in getting such things over the line. Assuming we can make that happen, what do folks think about MTJ vs Breeze vs JBLAS + commons-math since these seem like the viable alternativ

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35038739 @fommil I don't quite understand what "roll their own" means exactly here. I didn't propose to re-implement one or half linear algebra library in the PR. For the

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35036443 @mengxr and someone decided to roll their own instead of talking to me, why? I've had previous discussions with Apache about the MTJ license and I said we could p

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35030774 @fommil MTJ use LGPL. See http://www.apache.org/legal/resolved.html

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35021161 @fommil :-) Sorry to undersell. Breeze also has CSCMatrix support, but that's not entirely finished. On Thu, Feb 13, 2014 at 12:14 PM

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread fommil
Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35020661 @dlwh tracks? I thought I conclusively showed that there was zero impact to go native ;-) BTW, MTJ has sparse matrix support... and I also maintain it. What probl

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread dlwh
Github user dlwh commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35020275 I can cut a release this weekend. We wrap @fommil's netlib-java ( https://github.com/fommil/netlib-java), whose performance tracks with C pretty well. jblas i

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35018212 @MLnick MTJ is not an option because of its license.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35017848 @shivaram @srowen @giyengar Thanks for keeping the discussion running! @shivaram The requirement is to add sparse data support in all existing MLlib algor

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35015689 I guess we should discuss on the JIRA ideally, but most of the discussion seems to be here. So I'll comment here. I've chimed in before in the original PR

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35007797 @srowen Thanks for the summary. For the external API I wouldn't mind using (i, j, value) -- It results in larger files and a `groupBy` to get to a row

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread giyengar
Github user giyengar commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34988403 @shivaram, @srowen: I am a new user of Spark and MLLib in particular. I love the clean interface that MLLib currently has. It is a joy to code machine l

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-12 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34954740 My $0.02 to the discussion: 1. Within whatever operations mllib provides, serialization can be considered an implementation detail. But external serializa

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-12 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34948280 Sorry I missed this thread, but I'd like to understand a bit more about the scope of what we require in terms of library support before taking a decision.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34715135 Wow nice writeup. (Is Breeze benchmarked too somewhere? don't see it there). Totally agree. That's why I would use JBlas at least for the complex operations. Alth

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34714242 @srowen Thanks for the information! I believe native BLAS/LAPACK libraries performs much better than Java implementation for level 2 and level 3 operations, but f

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34712992 @debasish83 Are you speaking of the benchmark I posted to the JIRA? BLAS/LAPACK cannot be used for dense vector + sparse vector. Those are designed for dense-only

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34711528 I see the other discussion -- https://github.com/mesos/spark/pull/736 ? I didn't see the benchmark but maybe missed it. I think there was an impression th

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread debasish83
Github user debasish83 commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34710100 @mengxr as long as the interface is clean and we can bring in netlib-java, start with mahout-math does not seem like a bad idea...netlib-java uses jni while i

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34707127 @sscdotopen @debasish83 , I'm okay with copying VectorWritable and remove mahout-core from dependencies. @srowen Just as you mentioned, the sparse vector

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34692729 The mahout-math implementation of vectors is encumbered with a few bad design choices, Hadoop stuff that's not needed here, dependence on that old fork of colt co

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread debasish83
Github user debasish83 commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34682539 I agree...depending on mahout-math is much better than bringing in the mahout-core...mahout-math code I think will compile fine with Apache Hadoop, CDH and HD

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread sscdotopen
Github user sscdotopen commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34674441 I think making making the heavyweight mahout-core a dependency just for access to the sparse vectors is no good idea. A better way would be to just depend on

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34662959 Merged build finished.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34662960 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12668/

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34659925 Merged build started.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34659923 Merged build triggered.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/575 [Proposal] Adding sparse data support and update KMeans This is a proposal for sparse data support in mllib (https://spark-project.atlassian.net/browse/MLLIB-18). The idea of the p