[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529567#comment-16529567 ] Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:41 AM: -- [~tygert] Can you take a look at the following PR https://github.com/apache/spark/pull/21695 was (Author: nagpall): [~tygert] Can you take a look at the following PR [PR|[https://github.com/apache/spark/pull/21695|http://example.com/]] > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts >Priority: Major > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529567#comment-16529567 ] Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:40 AM: -- [~tygert] Can you take a look at the following PR [https://github.com/apache/spark/pull/21695|http://example.com/] was (Author: nagpall): [~tygert] Can you take a look at the following PR [https://github.com/apache/spark/pull/21695|http://example.com] > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts >Priority: Major > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529567#comment-16529567 ] Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:40 AM: -- [~tygert] Can you take a look at the following PR [PR|[https://github.com/apache/spark/pull/21695|http://example.com/]] was (Author: nagpall): [~tygert] Can you take a look at the following PR [https://github.com/apache/spark/pull/21695|http://example.com/] > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts >Priority: Major > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645652#comment-15645652 ] Mark Tygert edited comment on SPARK-8614 at 11/7/16 10:36 PM: -- This remains a big issue, rendering the results produced by MLlib to be incorrect for most matrix decompositions and matrix-matrix multiplications when using multiple executors or workers. [~hl475] of Yale is working to fix the problem, and eventually ML for DataFrames will need to incorporate his solutions. was (Author: tygert): This remains a big issue, rendering the results produced by MLlib to be incorrect for most matrix decompositions and matrix-matrix multiplications when using multiple executors or workers. Huamin Li of Yale is working to fix the problem, and eventually ML for DataFrames will need to incorporate his solutions. > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org