GitHub user johnc1231 opened a pull request:
https://github.com/apache/spark/pull/17459
[SPARK-20109][MLlib] Added toBlockMatrixDense to IndexedRowMatrix
## What changes were proposed in this pull request?
-I added the method `toBlockMatrixDense` to the IndexedRowMatrix
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya I made changes exactly as you did in your prototype, plus a few
style edits. But yeah, I think this is a good, easy to use implementation that
will be better in all use cases than current
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
Alright, I agree with this. We'll switch off Dense or Sparse matrix
backings based on what the type of the first vector in the iterator is. I'd be
happy to take on making these adjustments
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya I think we definitely care about giving users the ability to make
either dense or sparse Block matrices. I made a 100k by 10k IndexedRowMatrix of
random doubles, then converted
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
I think your prototype looks good. I'm pretty much just gonna do exactly
that then.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya Do you have any more comments on this, or are you happy with it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r109417523
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -113,6 +114,67 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
Thanks for the feedback guys. All comments addressed, though if anyone else
has feedback on discussion me and @viirya are having about whether there should
be a separate `toBlockMatrixDense
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
Did some thinking about this, and I think that to make the API cleaner
maybe we could deprecate the regular `toBlockMatrix` method and add
`toBlockMatrixSparse`. Until it's removed, we could just
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r109320386
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala
---
@@ -89,11 +89,42 @@ class IndexedRowMatrixSuite
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r109320642
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -113,6 +114,67 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
Addressed comments where everything was clear, replied to the last one
about only having one toBlockMatrix. Back to you @viirya . Thanks for feedback.
---
If your project is set up for it, you
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r109320701
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala
---
@@ -89,11 +89,42 @@ class IndexedRowMatrixSuite
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r109320468
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -113,6 +114,67 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r110163778
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala
---
@@ -89,11 +89,19 @@ class IndexedRowMatrixSuite
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya Added appropriate tests, back to you.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r111302756
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala
---
@@ -87,21 +87,74 @@ class IndexedRowMatrixSuite
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r111302391
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -108,8 +108,64 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r111492414
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala
---
@@ -87,19 +87,92 @@ class IndexedRowMatrixSuite
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya
Gonna fix that last test tonight, but just doing a quick timing on my
laptop, (2014 Macbook Pro with 2.2Hz i7), the difference was as follows on a
10k by 10k matrix of random doubles
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya I fixed the test as you asked, so please take a look when you get a
chance. I'm having a little bit of trouble with my local spark build for some
reason, but I'll do that other benchmark
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya Any more feedback on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
Also, made changes suggested by @srowen . Don't know if he now has to sign
off on those changes being done.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya Now that style nitpicks and sparse benchmarks are done, are you
good with this? Also, per your recommendation, CCing @MLnick and @jkbradley for
review of this. Should be easy to review
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
This has been reviewed pretty thoroughly at this point. Can a committer
give this a quick look? @srowen @MLnick @jkbradley I think it's basically ready
to go in.
---
If your project is set up
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r118793567
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala
---
@@ -87,19 +87,96 @@ class IndexedRowMatrixSuite
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r118793614
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -108,8 +108,69 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@srowen Addressed comments, back to you. And thanks for taking time to look
this over.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r118793854
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -108,8 +108,69 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@srowen Fixed both, back to you
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r117620801
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -108,8 +108,64 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r117621336
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -108,8 +108,64 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@srowen @viirya All comments addressed, back to you guys. Hopefully we've
just about reached something ready to commit.
---
If your project is set up for it, you can reply to this email
Github user johnc1231 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17459#discussion_r119463845
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
---
@@ -108,8 +108,69 @@ class IndexedRowMatrix @Since
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
@viirya Addressed style nitpicks and did spark benchmarks. Think that
should be everything.
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user johnc1231 commented on the issue:
https://github.com/apache/spark/pull/17459
Did a sparse benchmark (2014 Macbook Pro with 2.2Hz i7) 60 partitions, 10k
by 10k matrix with mostly 0's, 10% 1's, made of SparseVectors. Both old method
and new method took about 7.5 seconds
36 matches
Mail list logo