[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

2018-09-07 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21942#discussion_r216021077
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -160,15 +160,88 @@ class StandardScalerModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val scaler = new feature.StandardScalerModel(std, mean, $(withStd), 
$(withMean))
-
-// TODO: Make the transformer natively in ml framework to avoid extra 
conversion.
-val transformer: Vector => Vector = v => 
scaler.transform(OldVectors.fromML(v)).asML
+val transformer: Vector => Vector = v => transform(v)
 
 val scale = udf(transformer)
 dataset.withColumn($(outputCol), scale(col($(inputCol
   }
 
+  /**
+   * Since `shift` will be only used in `withMean` branch, we have it as
+   * `lazy val` so it will be evaluated in that branch. Note that we don't
+   * want to create this array multiple times in `transform` function.
+   */
+  private lazy val shift: Array[Double] = mean.toArray
--- End diff --

How does this interplay with serialization? Would it make sense to evaluate 
this before we serialize the UDF so it isn't done on each worker or?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

2018-09-07 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21942#discussion_r216022250
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -160,15 +160,88 @@ class StandardScalerModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val scaler = new feature.StandardScalerModel(std, mean, $(withStd), 
$(withMean))
-
-// TODO: Make the transformer natively in ml framework to avoid extra 
conversion.
-val transformer: Vector => Vector = v => 
scaler.transform(OldVectors.fromML(v)).asML
+val transformer: Vector => Vector = v => transform(v)
 
 val scale = udf(transformer)
 dataset.withColumn($(outputCol), scale(col($(inputCol
   }
 
+  /**
+   * Since `shift` will be only used in `withMean` branch, we have it as
+   * `lazy val` so it will be evaluated in that branch. Note that we don't
+   * want to create this array multiple times in `transform` function.
+   */
+  private lazy val shift: Array[Double] = mean.toArray
+
+   /**
+* Applies standardization transformation on a vector.
+*
+* @param vector Vector to be standardized.
+* @return Standardized vector. If the std of a column is zero, it will 
return default `0.0`
+* for the column with zero std.
+*/
+  private[spark] def transform(vector: Vector): Vector = {
+require(mean.size == vector.size)
+if ($(withMean)) {
+  /**
+   * By default, Scala generates Java methods for member variables. So 
every time
+   * member variables are accessed, `invokespecial` is called. This is 
an expensive
+   * operation, and can be avoided by having a local reference of 
`shift`.
+   */
+  val localShift = shift
+  /**  Must have a copy of the values since they will be modified in 
place. */
+  val values = vector match {
+/** Handle DenseVector specially because its `toArray` method does 
not clone values. */
+case d: DenseVector => d.values.clone()
+case v: Vector => v.toArray
+  }
+  val size = values.length
+  if ($(withStd)) {
+var i = 0
+while (i < size) {
+  values(i) = if (std(i) != 0.0) (values(i) - localShift(i)) * 
(1.0 / std(i)) else 0.0
+  i += 1
+}
+  } else {
+var i = 0
+while (i < size) {
+  values(i) -= localShift(i)
+  i += 1
+}
+  }
+  Vectors.dense(values)
+} else if ($(withStd)) {
--- End diff --

Maybe leave a comment withStd and not mean since when tracing the code by 
hand the nested if/else if can get a bit confusing flow wise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

2018-08-01 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/21942#discussion_r207103066
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala ---
@@ -160,15 +160,89 @@ class StandardScalerModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val scaler = new feature.StandardScalerModel(std, mean, $(withStd), 
$(withMean))
-
-// TODO: Make the transformer natively in ml framework to avoid extra 
conversion.
-val transformer: Vector => Vector = v => 
scaler.transform(OldVectors.fromML(v)).asML
+val transformer: Vector => Vector = v => transform(v)
 
 val scale = udf(transformer)
 dataset.withColumn($(outputCol), scale(col($(inputCol
   }
 
+  /**
+   * Since `shift` will be only used in `withMean` branch, we have it as
+   * `lazy val` so it will be evaluated in that branch. Note that we don't
+   * want to create this array multiple times in `transform` function.
+   */
+  private lazy val shift: Array[Double] = mean.toArray
+
+   /**
+* Applies standardization transformation on a vector.
+*
+* @param vector Vector to be standardized.
+* @return Standardized vector. If the std of a column is zero, it will 
return default `0.0`
+* for the column with zero std.
+*/
+  @Since("2.3.0")
+  def transform(vector: Vector): Vector = {
--- End diff --

private[spark]?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21942: [SPARK-24283][ML] Make ml.StandardScaler skip con...

2018-08-01 Thread sujithjay
GitHub user sujithjay opened a pull request:

https://github.com/apache/spark/pull/21942

[SPARK-24283][ML] Make ml.StandardScaler skip conversion of Spar…

…k ml vectors to mllib vectors

## What changes were proposed in this pull request?
Currently, ml.StandardScaler converts Spark  ml vectors to mllib vectors 
during transform operation. This PR makes changes to skip this step.

## How was this patch tested?
Existing tests in StandardScalerSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujithjay/spark SPARK-24283

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21942.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21942


commit c010d1bb904717412591384f31bd085e3d98f502
Author: sujithjay 
Date:   2018-08-01T08:37:10Z

[SPARK-24283][ML][WIP] Make ml.StandardScaler skip conversion of Spark ml 
vectors to mllib vectors




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org