Peng Meng created SPARK-21680:
---------------------------------

             Summary: ML/MLLIB Vector compressed optimization
                 Key: SPARK-21680
                 URL: https://issues.apache.org/jira/browse/SPARK-21680
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 2.3.0
            Reporter: Peng Meng


When use Vector.compressed to change a Vector to SparseVector, the performance 
is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, 
but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance 
difference between this two method.
Code of Vector compressed:
{code:java}
  def compressed: Vector = {
    val nnz = numNonzeros
    // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 
* nnz + 20 bytes.
    if (1.5 * (nnz + 1.0) < size) {
      toSparse
    } else {
      toDense
    }
  }
{code}

I propose to change it to:


{code:java}
// Some comments here
def compressed: Vector = {
    val nnz = numNonzeros
    // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 
* nnz + 20 bytes.
    if (1.5 * (nnz + 1.0) < size) {
      val ii = new Array[Int](nnz)
      val vv = new Array[Double](nnz)
      var k = 0
      foreachActive { (i, v) =>
        if (v != 0) {
          ii(k) = i
          vv(k) = v
        k += 1
        }
    }
    new SparseVector(size, ii, vv)
    } else {
      toDense
    }
  }
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to