Ohad Raviv created SPARK-16820:
----------------------------------

             Summary: Sparse - Sparse matrix multiplication
                 Key: SPARK-16820
                 URL: https://issues.apache.org/jira/browse/SPARK-16820
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 2.0.0
            Reporter: Ohad Raviv


While working on MCL implementation on Spark we have encountered some 
difficulties.
The main part of this process is distributed sparse matrix multiplication that 
has two main steps:
1.      Simulate multiply – preparation before the real multiplication in order 
to see which blocks should be multiplied.
2.      The actual blocks multiplication and summation.

In our case the sparse matrix has 50M rows and columns, and 2B non-zeros.

The current multiplication suffers from these issues:

1.      A relatively trivial bug already fixed in the first step the caused the 
process to be very slow [SPARK-16469]
2.      Still after the bug fix, if we have too many blocks the Simulate 
multiply will take very long time and will multiply the data many times. 
(O(n^3) where n is the number of blocks)
3.      Spark supports only multiplication with Dense matrices. Thus, it 
converts a Sparse matrix into a dense matrix before the multiplication.
4.      For summing the intermediate block results Spark uses Breeze’s CSC 
matrix operations – here the problem is that it is very inefficient to update a 
CSC matrix in a zero value.

That means that with many blocks (default block size is 1024) – in our case 
50M/1024 ~= 50K, the simulate multiply will effectively never finish or will 
generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger block 
size e.g. 100k we get OutOfMemoryException in the “toDense” method of the 
multiply. We have worked around that by implementing our-selves both the Sparse 
multiplication and addition in a very naïve way – but at least it works.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to