subject:"\[jira\] \[Comment Edited\] \(SPARK\-8614\) Row order preservation for operations on MLlib IndexedRowMatrix"

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529567#comment-16529567
 ] 

Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:41 AM:
--

[~tygert] Can you take a look at the following PR 
https://github.com/apache/spark/pull/21695


was (Author: nagpall):
[~tygert] Can you take a look at the following PR 

[PR|[https://github.com/apache/spark/pull/21695|http://example.com/]]

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529567#comment-16529567
 ] 

Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:40 AM:
--

[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com/]



was (Author: nagpall):
[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com]

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529567#comment-16529567
 ] 

Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:40 AM:
--

[~tygert] Can you take a look at the following PR 

[PR|[https://github.com/apache/spark/pull/21695|http://example.com/]]


was (Author: nagpall):
[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com/]


> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2016-11-07 Thread Mark Tygert (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15645652#comment-15645652
 ] 

Mark Tygert edited comment on SPARK-8614 at 11/7/16 10:36 PM:
--

This remains a big issue, rendering the results produced by MLlib to be 
incorrect for most matrix decompositions and matrix-matrix multiplications when 
using multiple executors or workers. [~hl475] of Yale is working to fix the 
problem, and eventually ML for DataFrames will need to incorporate his 
solutions.


was (Author: tygert):
This remains a big issue, rendering the results produced by MLlib to be 
incorrect for most matrix decompositions and matrix-matrix multiplications when 
using multiple executors or workers. Huamin Li of Yale is working to fix the 
problem, and eventually ML for DataFrames will need to incorporate his 
solutions.

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

4 matches

Site Navigation

Mail list logo

Footer information