[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-03-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20472


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-28 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171184500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
--- End diff --

have fixed it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-28 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171182634
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
+val split = if (continuousSplits.contains(i)) continuousSplits(i) 
else Array.empty[Split]
--- End diff --

have fixed it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-28 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171182291
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
--- End diff --

got it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-27 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r171160692
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
--- End diff --

I have seen the note of function _sample_, and _sample_ does not guarantee 
to provide exactly the fraction of the count of the given RDD. It seems that 
requiring _numSamples - partNumSamples_ to be non-negative is a more efficient 
choice than trigger a _count_. The degree of approximation depends upon the 
degree approximation of _sample_. And it's sure that the splits will be 
inaccurate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-21 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r169844579
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
--- End diff --

Inaccurate zero counts can cause the change of splits, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-20 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r169391525
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
--- End diff --

The main problem I see with this is that the sampling we do for split 
finding is _approximate_. Just as an example: say you have 1000 samples, and 
you take 20% for split finding. Your actual sampled RDD has 220 samples in it, 
and 210 of those are non-zero. So, `partNumSamples = 210`, `numSamples = 200` 
and you wind up with `numSamples - partNumSamples = -10` zero values. This is 
not something you expect to happen often (since we care about the highly sparse 
case), but something that we need to consider. We could just require the 
subtraction to be non-negative (and live with a bit of approximation), or you 
could call `count` on the sampled RDD but I don't think it's worth it. Thoughts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-20 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r169386551
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
+
+  // add zero value count and get complete statistics
+  val valueCountMap: Map[Double, Int] = partValueCountMap + (0.0 -> 
(numSamples - partNumSamples))
--- End diff --

There can be negative values right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-10 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r167393891
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
--- End diff --

nit: 'only contains zero` -> `contain only zero`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166771387
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
+
+  // add zero value count and get complete statistics
+  val valueCountMap: Map[Double, Int] = partValueCountMap + (0.0 -> 
(numSamples - partNumSamples))
--- End diff --

This also probably doesn't matter but won't the new (0.0, ...) element 
always come first? you could append it below after sorting the rest rather than 
add it earlier to the Map.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166770380
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
+val split = if (continuousSplits.contains(i)) continuousSplits(i) 
else Array.empty[Split]
--- End diff --

Just `continuousSplits.getOrElse(i, Array.empty[Split])`? Not that it 
matters, but also avoids searching the map twice. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166771033
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
--- End diff --

Just a style nit but you can drop the types on these two new vars. I think 
we'd consider that implicit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166773606
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
--- End diff --

This bit of (existing) code seems obtuse to me. What about...

```
val partNumSamples = featureSamples.size
val partValueCountMap = scala.collection.mutable.Map[Double,Int]()
featureSamples.foreach { x =>
  partValueCountMap(x) = partValueCountMap.getOrElse(x, 0) + 1
}
```
  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-06 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166501185
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val requiredSamples: Long = (samplesFractionForFindSplits(metadata) 
* metadata.numExamples.toDouble).toLong
--- End diff --

Thanks for your comments, and im sorry for my carelessness. I have fixed 
them all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-06 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166249547
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1147,4 +1150,20 @@ private[spark] object RandomForest extends Logging {
   3 * totalBins
 }
   }
+
+  /**
+* Calculate the subsample fraction for finding splits
--- End diff --

bad indentation, this should fail scalastyle


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-06 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r166250001
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  val requiredSamples: Long = (samplesFractionForFindSplits(metadata) 
* metadata.numExamples.toDouble).toLong
--- End diff --

I think here it could be:
```
val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-04 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165872418
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  var requiredSamples: Long = math.max(metadata.maxBins * 
metadata.maxBins, 1)
--- End diff --

I defined a private method _samplesFractionForFindSplits_ in 
RandomForest.scala


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165671106
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging {
 } else {
   val numSplits = metadata.numSplits(featureIndex)
 
-  // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  // get count for each distinct value except zero value
+  val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
 case ((m, cnt), x) =>
   (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
   }
+
+  // Calculate the number of samples for finding splits
+  var requiredSamples: Long = math.max(metadata.maxBins * 
metadata.maxBins, 1)
--- End diff --

This logic is somewhat copied from another method. Can we create new 
method? Or pass the result through the methods?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-02 Thread lucio-yz
Github user lucio-yz commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165595281
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging {
   val numSplits = metadata.numSplits(featureIndex)
 
   // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  var (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
--- End diff --

problem has been solved


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165378186
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -931,7 +934,8 @@ private[spark] object RandomForest extends Logging {
 val numFeatures = metadata.numFeatures
 val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
   case i if metadata.isContinuous(i) =>
-val split = continuousSplits(i)
+// some features may only contains zero, so continuousSplits will 
not have a record
+val split = if ( continuousSplits.contains(i) ) 
continuousSplits(i) else Array.empty[Split]
--- End diff --

Remove space around parens


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165378109
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging {
   val numSplits = metadata.numSplits(featureIndex)
 
   // get count for each distinct value
-  val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+  var (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
--- End diff --

I'd probably avoid a `var` by using a new variable for `numSamples` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165340988
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -21,7 +21,6 @@ import java.io.IOException
 
 import scala.collection.mutable
 import scala.util.Random
-
--- End diff --

this empty line must be kept


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165341639
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
   // being spun up that will definitely do no work.
   val numPartitions = math.min(continuousFeatures.length, 
input.partitions.length)
 
+  val numInput = input.count()
+  val bcNumInput = input.sparkContext.broadcast(numInput)
+
   input
 .flatMap(point => continuousFeatures.map(idx => (idx, 
point.features(idx
--- End diff --

instead of adding the filter method there, here you can avoid the 
generation of the record itself if `point.features(idx)` is 0.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165341133
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
   // being spun up that will definitely do no work.
   val numPartitions = math.min(continuousFeatures.length, 
input.partitions.length)
 
+  val numInput = input.count()
+  val bcNumInput = input.sparkContext.broadcast(numInput)
--- End diff --

this is not needed, you can use directly `numInput`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20472#discussion_r165343004
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
   // being spun up that will definitely do no work.
   val numPartitions = math.min(continuousFeatures.length, 
input.partitions.length)
 
+  val numInput = input.count()
--- End diff --

we can get this from the `metadata.numExamples * fraction` operation in the 
calling method in order to avoid another job to perform the count


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

2018-02-01 Thread lucio-yz
GitHub user lucio-yz opened a pull request:

https://github.com/apache/spark/pull/20472

[SPARK-22751][ML]Improve ML RandomForest shuffle performance

## What changes were proposed in this pull request?

As I mentioned in 
[SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest),
 there is a shuffle performance problem in ML Randomforest when train a RF in 
high dimensional data. 

The reason is that, in org.apache.spark.tree.impl.RandomForest, the 
function findSplitsBySorting will actually flatmap a sparse vector into a dense 
vector, then in groupByKey there will be a huge shuffle write size.

To avoid this, we can add a filter after flatmap, to filter out zero value. 
And in function findSplitsForContinuousFeature, we can infer the number of zero 
value by pass a parameter numInput to function findSplitsForContinuousFeature. 
numInput is the number of samples.

In addition, if a feature only contains zero value, continuousSplits will 
not has the key of feature id. So I add a check when using continuousSplits.

## How was this patch tested?
Ran model locally using spark-submit.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lucio-yz/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20472


commit 50cb173dd34dc353c243b97f2686a8c545a03909
Author: lucio <576632108@...>
Date:   2018-02-01T09:47:52Z

fix mllib randomforest shuffle issue




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org