Anuj Nagpall created SPARK-24693: ------------------------------------ Summary: Row order preservation for operations on MLlib IndexedRowMatrix Key: SPARK-24693 URL: https://issues.apache.org/jira/browse/SPARK-24693 Project: Spark Issue Type: Bug Components: MLlib Reporter: Anuj Nagpall
In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are dropped before calling the methods from RowMatrix. For example for IndexedRowMatrix.computeSVD: val svd = toRowMatrix().computeSVD(k, computeU, rCond) and for IndexedRowMatrix.multiply: val mat = toRowMatrix().multiply(B). After computing these results, they are zipped with the original indices, e.g. for IndexedRowMatrix.computeSVD val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => IndexedRow(i, v) } and for IndexedRowMatrix.multiply: val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => IndexedRow(i, v) } I have experienced that for IndexedRowMatrix.computeSVD().U and IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row indices can get mixed (when running Spark jobs with multiple executors/machines): i.e. the vectors and indices of the result do not seem to correspond anymore. To me it looks like this is caused by zipping RDDs that have a different ordering? For the IndexedRowMatrix.multiply I have observed that ordering within partitions is preserved, but that it seems to get mixed up between partitions. For example, for: part1Index1 part1Vector1 part1Index2 part1Vector2 part2Index1 part2Vector1 part2Index2 part2Vector2 I got: part2Index1 part1Vector1 part2Index2 part1Vector2 part1Index1 part2Vector1 part1Index2 part2Vector2 Another observation is that the mapPartitions in RowMatrix.multiply : val AB = rows.mapPartitions { iter => had an "preservesPartitioning = true" argument in version 1.0, but this is no longer there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org