[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-10 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121175#comment-16121175
 ] 

Peng Meng commented on SPARK-21680:
---

I mean if the user call toSparse(size), but the size is smaller than 
numNonZero, there maybe problem. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121166#comment-16121166
 ] 

Sean Owen commented on SPARK-21680:
---

I don't get what security issue you mean here, but no the change you proposed 
initially is not a good solution. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120947#comment-16120947
 ] 

Peng Meng commented on SPARK-21680:
---

Hi [~srowen], if add toSparse(size), for secure reason, it is better to check 
size with numNonzeros, if size is larger than numNonzeros, the program may 
crash. If we check the size with numNonzeros, we still add one more scan to the 
value. 

So in this PR, I revise the code like this JIRA.

Thanks. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120136#comment-16120136
 ] 

Peng Meng commented on SPARK-21680:
---

Ok, thanks, I will submit a PR.

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120128#comment-16120128
 ] 

Sean Owen commented on SPARK-21680:
---

Yes, the latter should be private and the former calls it too, I suppose. 
Something like that.

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120115#comment-16120115
 ] 

Peng Meng commented on SPARK-21680:
---

Then we will have two toSparse:
toSparse
and 
toSparse(size)
Do you mean that? Thanks.

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120109#comment-16120109
 ] 

Sean Owen commented on SPARK-21680:
---

You definitely want to avoid duplicating the code, but could change toSparse to 
accept nnz if it's already known. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org