[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882179#comment-15882179 ] Nick Pentreath commented on SPARK-3434: --- This JIRA only has SPARK-3976 open. There was an old PR for it by [~brkyvz] here: https://github.com/apache/spark/pull/4286 (which was abandoned). Unless SPARK-3976 is a priority and someone wants to revive it, shall we close this JIRA since the rest of the tickets are resolved? > Distributed block matrix > > > Key: SPARK-3434 > URL: https://issues.apache.org/jira/browse/SPARK-3434 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman > > This JIRA is for discussing distributed matrices stored in block > sub-matrices. The main challenge is the partitioning scheme to allow adding > linear algebra operations in the future, e.g.: > 1. matrix multiplication > 2. matrix factorization (QR, LU, ...) > Let's discuss the partitioning and storage and how they fit into the above > use cases. > Questions: > 1. Should it be backed by a single RDD that contains all of the sub-matrices > or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175557#comment-14175557 ] Reza Zadeh commented on SPARK-3434: --- Thanks Shivaram! As discussed over the phone, we will use your design and build upon it, so that you can focus on the linear algebraic operations such as TSQR. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Shivaram Venkataraman This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171272#comment-14171272 ] Shivaram Venkataraman commented on SPARK-3434: -- Sorry for the delay in getting back -- I've posted a design doc at http://goo.gl/0eE5fh and a reference implementation at https://github.com/amplab/ml-matrix. The design doc and the reference implementation use Spark as a library -- so this works as a standalone library in case somebody wants to try it out. Some more points to note regarding the integration: 1. The existing implementation uses breeze matrices in the interface but we will change that to use local Matrix trait already present in Spark. 2. The matrix layouts will also extend the DistributedMatrix class in MLLib and we could create a new interface BlockDistributedMatrix from the interface in amplab/ml-matrix 3. We can also use this JIRA or create a new JIRA to discuss what algorithms / operations should be merged into Spark. I think TSQR, NormalEquations should be pretty useful. Other algorithms like 2-D BlockQR and BlockCoordinateDescent can be merged later if we feel its useful (these haven't been pushed to ml-matrix yet). I will create a first patch for the matrix formats in a couple of days. Please let me know if there are any questions / clarifications. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167152#comment-14167152 ] Burak Yavuz commented on SPARK-3434: [~ConcreteVitamin], any updates? Anything I can help out with? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478 ] Shivaram Venkataraman commented on SPARK-3434: -- ~brkyvz -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161762#comment-14161762 ] Ghousia Taj commented on SPARK-3434: Hi There, We at Impetus Infotech, are also working on Block Matrix implementation. To start with we are considering a Matrix as an RDD of Blocks. Each Block includes a submatrix, row range and column range. The work discussed in this jira, closely connects with the work we are doing at present on Matrix operation. Your reference implementation would really help us to progress faster. We are also looking forward to working in tandem with you all and contributing in this space. Many Thanks, Ghousia. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162156#comment-14162156 ] Xiangrui Meng commented on SPARK-3434: -- [~shivaram] and [~ConcreteVitamin] Any updates on the design doc and prototype? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158949#comment-14158949 ] Reza Zadeh commented on SPARK-3434: --- Any updates Shivaraman? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153774#comment-14153774 ] Shivaram Venkataraman commented on SPARK-3434: -- I'll post a design doc by sometime tonight. We also have a reference implementation that I will add a link to and we can base our discussion off that. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152275#comment-14152275 ] Reza Zadeh commented on SPARK-3434: --- It looks like Shivaram Venkataraman from the AMPlab has started work on this. I will be meeting with him to see if we can reuse some his work. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152636#comment-14152636 ] Xiangrui Meng commented on SPARK-3434: -- [~shivaram] Could you post the design of the partitioning strategy for block matrices? I think we should have a 2D partitioner, which consists of the row partitioner and column partitioner. A matrix with partitioner (p1, p2) can multiply a matrix with partitioner (p2, p3), resulting a matrix with partitioner (p1, p3). Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140160#comment-14140160 ] Gaurav Mishra commented on SPARK-3434: -- A matrix being represented by multiple RDDs of sub-matrices may be helpful when an operation on the matrix requires computation over only a small set of its sub-matrices. However, operations like matrix multiplication require computation over all elements in the matrix (i.e. all elements need to be read). Therefore, at least in the case of matrix multiplication, keeping a single RDD seems to be a better idea. Keeping multiple RDDs in that case will only burden us further with the task of keeping track of all sub matrices. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org