[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20686


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-14 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r174656617
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala ---
@@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with 
MLlibTestSparkContext with De
 
 val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
 
-def validateResults(df: DataFrame): Unit = {
-  df.select("result", "expected").collect().foreach { case Row(vec1: 
Vector, vec2: Vector) =>
+def validateResults(rows: Seq[Row]): Unit = {
+  rows.foreach { case Row(vec1: Vector, vec2: Vector) =>
 assert(vec1 === vec2)
   }
-  val resultMetadata = 
AttributeGroup.fromStructField(df.schema("result"))
-  val expectedMetadata = 
AttributeGroup.fromStructField(df.schema("expected"))
+  val resultMetadata = 
AttributeGroup.fromStructField(rows.head.schema("result"))
+  val expectedMetadata = 
AttributeGroup.fromStructField(rows.head.schema("expected"))
   assert(resultMetadata.numAttributes === 
expectedMetadata.numAttributes)
   
resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { 
case (a, b) =>
 assert(a === b)
   }
 }
 
 vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty)
-validateResults(vectorSlicer.transform(df))
+testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, 
"result", "expected")(
--- End diff --

I see, makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r174655509
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala ---
@@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with 
MLlibTestSparkContext with De
 
 val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
 
-def validateResults(df: DataFrame): Unit = {
-  df.select("result", "expected").collect().foreach { case Row(vec1: 
Vector, vec2: Vector) =>
+def validateResults(rows: Seq[Row]): Unit = {
+  rows.foreach { case Row(vec1: Vector, vec2: Vector) =>
 assert(vec1 === vec2)
   }
-  val resultMetadata = 
AttributeGroup.fromStructField(df.schema("result"))
-  val expectedMetadata = 
AttributeGroup.fromStructField(df.schema("expected"))
+  val resultMetadata = 
AttributeGroup.fromStructField(rows.head.schema("result"))
+  val expectedMetadata = 
AttributeGroup.fromStructField(rows.head.schema("expected"))
   assert(resultMetadata.numAttributes === 
expectedMetadata.numAttributes)
   
resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { 
case (a, b) =>
 assert(a === b)
   }
 }
 
 vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty)
-validateResults(vectorSlicer.transform(df))
+testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, 
"result", "expected")(
--- End diff --

ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-14 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r174535931
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -299,18 +310,17 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
 
-val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
-  Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)),
-  Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)),
-  Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0)))
+val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
--- End diff --

I agree that's correct.  The problem is that people tend to see these 
patterns and copy them without thinking.  It's best to follow patterns which 
help other contributors to avoid making mistakes.
I'm OK with leaving it since this issue is scattered throughout MLlib tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-13 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r174247908
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala ---
@@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with 
MLlibTestSparkContext with De
 
 val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
 
-def validateResults(df: DataFrame): Unit = {
-  df.select("result", "expected").collect().foreach { case Row(vec1: 
Vector, vec2: Vector) =>
+def validateResults(rows: Seq[Row]): Unit = {
+  rows.foreach { case Row(vec1: Vector, vec2: Vector) =>
 assert(vec1 === vec2)
   }
-  val resultMetadata = 
AttributeGroup.fromStructField(df.schema("result"))
-  val expectedMetadata = 
AttributeGroup.fromStructField(df.schema("expected"))
+  val resultMetadata = 
AttributeGroup.fromStructField(rows.head.schema("result"))
+  val expectedMetadata = 
AttributeGroup.fromStructField(rows.head.schema("expected"))
   assert(resultMetadata.numAttributes === 
expectedMetadata.numAttributes)
   
resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { 
case (a, b) =>
 assert(a === b)
   }
 }
 
 vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty)
-validateResults(vectorSlicer.transform(df))
+testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, 
"result", "expected")(
--- End diff --

The reason I have chosen the global check function is the checks for the 
attributes:

```
  val resultMetadata = 
AttributeGroup.fromStructField(rows.head.schema("result"))
  val expectedMetadata = 
AttributeGroup.fromStructField(rows.head.schema("expected"))
  assert(resultMetadata.numAttributes === 
expectedMetadata.numAttributes)
  
resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { 
case (a, b) =>
assert(a === b)
  }
```
This is part is not row based but more like result set based.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-13 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r174245361
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala ---
@@ -58,14 +57,16 @@ class VectorAssemblerSuite
 assert(v2.isInstanceOf[DenseVector])
   }
 
-  test("VectorAssembler") {
+  ignore("VectorAssembler") {
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173999280
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -86,16 +94,19 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
 }
   }
 
-  test("label column already exists but is not numeric type") {
+  ignore("label column already exists but is not numeric type") {
--- End diff --

Thanks, this is a very good catch. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173999125
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala ---
@@ -58,14 +57,16 @@ class VectorAssemblerSuite
 assert(v2.isInstanceOf[DenseVector])
   }
 
-  test("VectorAssembler") {
+  ignore("VectorAssembler") {
--- End diff --

@attilapiros You need revert code here and keep old `VectorAssembler` 
testsuite here. `VectorAssembler` do not support streaming mode unless you 
pipeline a `VectorSizeHint` before it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173998034
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -299,18 +310,17 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
 
-val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
-  Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)),
-  Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)),
-  Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0)))
+val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
--- End diff --

I confirmed this with @cloud-fan 
If use the pattern:
```
Seq(...).toDF().select(...).collect()
```
will use `localRelation` and will always use one partition to do 
computation. And the output row order will keep the same with the input seq.

and seems many other testcases use similar way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173995919
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -299,18 +310,17 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
 
-val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
-  Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)),
-  Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)),
-  Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0)))
+val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
--- End diff --

I'd make a little argument that if without shuffling, dataframe 
transformation will keep row ordering. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173594072
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -299,18 +310,17 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
 
-val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
-  Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)),
-  Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)),
-  Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0)))
+val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 
1.0), (5, 0.0)),
--- End diff --

I'd keep using Set.  The point is to be row-order-agnostic since DataFrames 
do not guarantee stable row ordering in general.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173592896
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -109,16 +111,14 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
   .fit(df)
-val transformed = indexer.transform(df)
-val attr = Attribute.fromStructField(transformed.schema("labelIndex"))
-  .asInstanceOf[NominalAttribute]
-assert(attr.values.get === Array("100", "300", "200"))
-val output = transformed.select("id", "labelIndex").rdd.map { r =>
-  (r.getInt(0), r.getDouble(1))
-}.collect().toSet
-// 100 -> 0, 200 -> 2, 300 -> 1
-val expected = Set((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), 
(5, 1.0))
-assert(output === expected)
+testTransformerByGlobalCheckFunc[(Int, String)](df, indexer, "id", 
"labelIndex") { rows =>
+  val attr = Attribute.fromStructField(rows.head.schema("labelIndex"))
+.asInstanceOf[NominalAttribute]
+  assert(attr.values.get === Array("100", "300", "200"))
+  // 100 -> 0, 200 -> 2, 300 -> 1
+  val expected = Seq((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), 
(5, 1.0)).toDF()
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600685
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
@@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
 MLTestingUtils.checkCopyAndUids(vectorIndexer, model)
 
-model.transform(densePoints1) // should work
-model.transform(sparsePoints1) // should work
+// should work
+testTransformer[FeatureData](densePoints1, model, "indexed") { _ => }
+// should work
+testTransformer[FeatureData](sparsePoints1, model, "indexed") { _ => }
+
 // If the data is local Dataset, it throws AssertionError directly.
-intercept[AssertionError] {
-  model.transform(densePoints2).collect()
-  logInfo("Did not throw error when fit, transform were called on 
vectors of different lengths")
+withClue("Did not found expected error message when fit, " +
+  "transform were called on vectors of different lengths") {
+  testTransformerByInterceptingException[FeatureData](
+densePoints2,
+model,
+"VectorIndexerModel expected vector of length 3 but found length 
4",
+"indexed")
 }
 // If the data is distributed Dataset, it throws SparkException
 // which is the wrapper of AssertionError.
-intercept[SparkException] {
-  model.transform(densePoints2.repartition(2)).collect()
-  logInfo("Did not throw error when fit, transform were called on 
vectors of different lengths")
+withClue("Did not found expected error message when fit, " +
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600416
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala ---
@@ -58,14 +57,16 @@ class VectorAssemblerSuite
 assert(v2.isInstanceOf[DenseVector])
   }
 
-  test("VectorAssembler") {
+  ignore("VectorAssembler") {
--- End diff --

Don't ignore; just test on batch.  This case is solved with VectorSizeHint.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173593795
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -247,14 +253,18 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
   .fit(df)
-val transformed = indexer.transform(df)
+val expected1 = Seq(0.0, 2.0, 1.0, 0.0, 0.0, 
1.0).map(Tuple1(_)).toDF("labelIndex")
--- End diff --

Not needed; this isn't what this unit test is testing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600331
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -86,16 +94,19 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
 }
   }
 
-  test("label column already exists but is not numeric type") {
+  ignore("label column already exists but is not numeric type") {
--- End diff --

testTransformerByInterceptingException should use ```(Int, Boolean)``` (not 
Double)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173584784
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -324,19 +352,24 @@ class QuantileDiscretizerSuite
   .setStages(Array(discretizerForCol1, discretizerForCol2, 
discretizerForCol3))
   .fit(df)
 
-val resultForMultiCols = plForMultiCols.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
+val expected = plForSingleCol.transform(df).select("result1", 
"result2", "result3").collect()
 
-val resultForSingleCol = plForSingleCol.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
+testTransformerByGlobalCheckFunc[(Double, Double, Double)](
+  df,
+  plForMultiCols,
+  "result1",
+  "result2",
+  "result3") { rows =>
+assert(rows === expected)
+  }
 
-resultForSingleCol.zip(resultForMultiCols).foreach {
-  case (rowForSingle, rowForMultiCols) =>
-assert(rowForSingle.getDouble(0) == rowForMultiCols.getDouble(0) &&
-  rowForSingle.getDouble(1) == rowForMultiCols.getDouble(1) &&
-  rowForSingle.getDouble(2) == rowForMultiCols.getDouble(2))
+testTransformerByGlobalCheckFunc[(Double, Double, Double)](
--- End diff --

I'd remove this.  Testing vs. multiCol is already testing batch vs 
streaming.  No need to test singleCol against itself.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173582122
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala ---
@@ -90,23 +96,29 @@ class OneHotEncoderSuite
 val encoder = new OneHotEncoder()
   .setInputCol("size")
   .setOutputCol("encoded")
-val output = encoder.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, encoder, "encoded") { 
rows =>
+  val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+  assert(group.size === 2)
+  assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+  assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+}
   }
 
-  test("input column without ML attribute") {
+
+  ignore("input column without ML attribute") {
--- End diff --

Let's keep the test but limit it to batch. People should switch to 
OneHotEncoderEstimator anyways.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173593049
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -109,16 +111,14 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
   .fit(df)
-val transformed = indexer.transform(df)
-val attr = Attribute.fromStructField(transformed.schema("labelIndex"))
-  .asInstanceOf[NominalAttribute]
-assert(attr.values.get === Array("100", "300", "200"))
-val output = transformed.select("id", "labelIndex").rdd.map { r =>
-  (r.getInt(0), r.getDouble(1))
-}.collect().toSet
-// 100 -> 0, 200 -> 2, 300 -> 1
-val expected = Set((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), 
(5, 1.0))
-assert(output === expected)
+testTransformerByGlobalCheckFunc[(Int, String)](df, indexer, "id", 
"labelIndex") { rows =>
+  val attr = Attribute.fromStructField(rows.head.schema("labelIndex"))
+.asInstanceOf[NominalAttribute]
+  assert(attr.values.get === Array("100", "300", "200"))
+  // 100 -> 0, 200 -> 2, 300 -> 1
+  val expected = Seq((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), 
(5, 1.0)).toDF()
--- End diff --

And elsewhere below


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600642
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
@@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
 MLTestingUtils.checkCopyAndUids(vectorIndexer, model)
 
-model.transform(densePoints1) // should work
-model.transform(sparsePoints1) // should work
+// should work
+testTransformer[FeatureData](densePoints1, model, "indexed") { _ => }
+// should work
+testTransformer[FeatureData](sparsePoints1, model, "indexed") { _ => }
+
 // If the data is local Dataset, it throws AssertionError directly.
-intercept[AssertionError] {
-  model.transform(densePoints2).collect()
-  logInfo("Did not throw error when fit, transform were called on 
vectors of different lengths")
+withClue("Did not found expected error message when fit, " +
--- End diff --

Or just use the original text


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173587524
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), 
("male", "baz", 5))
 .toDF("id", "a", "b")
 val model = formula.fit(original)
+val attr = NominalAttribute.defaultAttr
 val expected = Seq(
 ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
 ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0),
 ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0),
 ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
 ).toDF("id", "a", "b", "features", "label")
-// assert(result.schema.toString == resultSchema.toString)
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

I'd just keep what you have now.  There isn't a great solution here, and 
what you have fits other code examples in MLlib.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173592635
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -70,36 +71,37 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
   .fit(df)
+
 // Verify we throw by default with unseen values
-intercept[SparkException] {
-  indexer.transform(df2).collect()
-}
+testTransformerByInterceptingException[(Int, String)](
+  df2,
+  indexer,
+  "Unseen label:",
+  "labelIndex")
 
 indexer.setHandleInvalid("skip")
-// Verify that we skip the c record
-val transformedSkip = indexer.transform(df2)
-val attrSkip = 
Attribute.fromStructField(transformedSkip.schema("labelIndex"))
-  .asInstanceOf[NominalAttribute]
-assert(attrSkip.values.get === Array("b", "a"))
-val outputSkip = transformedSkip.select("id", "labelIndex").rdd.map { 
r =>
-  (r.getInt(0), r.getDouble(1))
-}.collect().toSet
-// a -> 1, b -> 0
-val expectedSkip = Set((0, 1.0), (1, 0.0))
-assert(outputSkip === expectedSkip)
+
+testTransformerByGlobalCheckFunc[(Int, String)](df2, indexer, "id", 
"labelIndex") { rows =>
+  val attrSkip = 
Attribute.fromStructField(rows.head.schema("labelIndex"))
+.asInstanceOf[NominalAttribute]
+  assert(attrSkip.values.get === Array("b", "a"))
+  // Verify that we skip the c record
+  // a -> 1, b -> 0
+  val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF()
--- End diff --

This can be moved outside of the testTransformerByGlobalCheckFunc method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173594378
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -328,7 +338,12 @@ class StringIndexerSuite
   .setOutputCol("CITYIndexed")
   .fit(dfNoBristol)
 
-val dfWithIndex = model.transform(dfNoBristol)
-assert(dfWithIndex.filter($"CITYIndexed" === 1.0).count == 1)
+testTransformerByGlobalCheckFunc[(String, String, String)](
+  dfNoBristol,
+  model,
+  "CITYIndexed") { rows =>
+  val transformed = rows.map { r => r.getDouble(0) 
}.toDF("CITYIndexed")
--- End diff --

It's probably easier to avoid going through a DataFrame here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600463
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala ---
@@ -76,16 +77,18 @@ class VectorAssemblerSuite
 val assembler = new VectorAssembler()
   .setInputCols(Array("a", "b", "c"))
   .setOutputCol("features")
-val thrown = intercept[IllegalArgumentException] {
-  assembler.transform(df)
-}
-assert(thrown.getMessage contains
+testTransformerByInterceptingException[(String, String, String)](
+  df,
+  assembler,
   "Data type StringType of column a is not supported.\n" +
   "Data type StringType of column b is not supported.\n" +
-  "Data type StringType of column c is not supported.")
+  "Data type StringType of column c is not supported.",
+  "features")
   }
 
-  test("ML attributes") {
+  ignore("ML attributes") {
--- End diff --

ditto: do not ignore


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173876598
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala ---
@@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with 
MLlibTestSparkContext with De
 
 val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
 
-def validateResults(df: DataFrame): Unit = {
-  df.select("result", "expected").collect().foreach { case Row(vec1: 
Vector, vec2: Vector) =>
+def validateResults(rows: Seq[Row]): Unit = {
+  rows.foreach { case Row(vec1: Vector, vec2: Vector) =>
 assert(vec1 === vec2)
   }
-  val resultMetadata = 
AttributeGroup.fromStructField(df.schema("result"))
-  val expectedMetadata = 
AttributeGroup.fromStructField(df.schema("expected"))
+  val resultMetadata = 
AttributeGroup.fromStructField(rows.head.schema("result"))
+  val expectedMetadata = 
AttributeGroup.fromStructField(rows.head.schema("expected"))
   assert(resultMetadata.numAttributes === 
expectedMetadata.numAttributes)
   
resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { 
case (a, b) =>
 assert(a === b)
   }
 }
 
 vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty)
-validateResults(vectorSlicer.transform(df))
+testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, 
"result", "expected")(
--- End diff --

Avoid using a global check function when you don't need to.  It'd be better 
to use testTransformer() since the test is per-row.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173593885
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -247,14 +253,18 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
   .fit(df)
-val transformed = indexer.transform(df)
+val expected1 = Seq(0.0, 2.0, 1.0, 0.0, 0.0, 
1.0).map(Tuple1(_)).toDF("labelIndex")
+testTransformerByGlobalCheckFunc[(Int, String)](df, indexer, 
"labelIndex") { rows =>
+  assert(rows == expected1.collect().seq)
+}
+
 val idx2str = new IndexToString()
   .setInputCol("labelIndex")
   .setOutputCol("sameLabel")
   .setLabels(indexer.labels)
-idx2str.transform(transformed).select("label", 
"sameLabel").collect().foreach {
-  case Row(a: String, b: String) =>
-assert(a === b)
+
+testTransformerByGlobalCheckFunc[(Double)](expected1, idx2str, 
"sameLabel") { rows =>
--- End diff --

You should be able to test per-row, rather than using a global check 
function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173592811
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -70,36 +71,37 @@ class StringIndexerSuite
   .setInputCol("label")
   .setOutputCol("labelIndex")
   .fit(df)
+
 // Verify we throw by default with unseen values
-intercept[SparkException] {
-  indexer.transform(df2).collect()
-}
+testTransformerByInterceptingException[(Int, String)](
+  df2,
+  indexer,
+  "Unseen label:",
+  "labelIndex")
 
 indexer.setHandleInvalid("skip")
-// Verify that we skip the c record
-val transformedSkip = indexer.transform(df2)
-val attrSkip = 
Attribute.fromStructField(transformedSkip.schema("labelIndex"))
-  .asInstanceOf[NominalAttribute]
-assert(attrSkip.values.get === Array("b", "a"))
-val outputSkip = transformedSkip.select("id", "labelIndex").rdd.map { 
r =>
-  (r.getInt(0), r.getDouble(1))
-}.collect().toSet
-// a -> 1, b -> 0
-val expectedSkip = Set((0, 1.0), (1, 0.0))
-assert(outputSkip === expectedSkip)
+
+testTransformerByGlobalCheckFunc[(Int, String)](df2, indexer, "id", 
"labelIndex") { rows =>
+  val attrSkip = 
Attribute.fromStructField(rows.head.schema("labelIndex"))
+.asInstanceOf[NominalAttribute]
+  assert(attrSkip.values.get === Array("b", "a"))
+  // Verify that we skip the c record
+  // a -> 1, b -> 0
+  val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF()
+  assert(rows.seq === expectedSkip.collect().toSeq)
+}
 
 indexer.setHandleInvalid("keep")
+
 // Verify that we keep the unseen records
-val transformedKeep = indexer.transform(df2)
-val attrKeep = 
Attribute.fromStructField(transformedKeep.schema("labelIndex"))
-  .asInstanceOf[NominalAttribute]
-assert(attrKeep.values.get === Array("b", "a", "__unknown"))
-val outputKeep = transformedKeep.select("id", "labelIndex").rdd.map { 
r =>
-  (r.getInt(0), r.getDouble(1))
-}.collect().toSet
-// a -> 1, b -> 0, c -> 2, d -> 3
-val expectedKeep = Set((0, 1.0), (1, 0.0), (2, 2.0), (3, 2.0))
-assert(outputKeep === expectedKeep)
+testTransformerByGlobalCheckFunc[(Int, String)](df2, indexer, "id", 
"labelIndex") { rows =>
+  val attrKeep = 
Attribute.fromStructField(rows.head.schema("labelIndex"))
+.asInstanceOf[NominalAttribute]
+  assert(attrKeep.values.get === Array("b", "a", "__unknown"))
+  // a -> 1, b -> 0, c -> 2, d -> 3
+  val expectedKeep = Seq((0, 1.0), (1, 0.0), (2, 2.0), (3, 2.0)).toDF()
--- End diff --

ditto: move outside checkFunc


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600557
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
@@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
 MLTestingUtils.checkCopyAndUids(vectorIndexer, model)
 
-model.transform(densePoints1) // should work
-model.transform(sparsePoints1) // should work
+// should work
+testTransformer[FeatureData](densePoints1, model, "indexed") { _ => }
+// should work
+testTransformer[FeatureData](sparsePoints1, model, "indexed") { _ => }
+
 // If the data is local Dataset, it throws AssertionError directly.
-intercept[AssertionError] {
-  model.transform(densePoints2).collect()
-  logInfo("Did not throw error when fit, transform were called on 
vectors of different lengths")
+withClue("Did not found expected error message when fit, " +
--- End diff --

found -> find


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173584864
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -364,18 +397,26 @@ class QuantileDiscretizerSuite
   .setOutputCols(Array("result1", "result2", "result3"))
   .setNumBucketsArray(Array(10, 10, 10))
 
-val result1 = discretizerSingleNumBuckets.fit(df).transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-val result2 = discretizerNumBucketsArray.fit(df).transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-
-result1.zip(result2).foreach {
-  case (row1, row2) =>
-assert(row1.getDouble(0) == row2.getDouble(0) &&
-  row1.getDouble(1) == row2.getDouble(1) &&
-  row1.getDouble(2) == row2.getDouble(2))
+val model = discretizerSingleNumBuckets.fit(df)
+val expected = model.transform(df).select("result1", "result2", 
"result3").collect()
+
+
+testTransformerByGlobalCheckFunc[(Double, Double, Double)](
+  df,
+  model,
+  "result1",
+  "result2",
+  "result3") { rows =>
+  assert(rows === expected)
+}
+
+testTransformerByGlobalCheckFunc[(Double, Double, Double)](
--- End diff --

Is this a repeat of the test just above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173600517
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
@@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 
 MLTestingUtils.checkCopyAndUids(vectorIndexer, model)
 
-model.transform(densePoints1) // should work
-model.transform(sparsePoints1) // should work
+// should work
--- End diff --

We can remove "should work" comments  : P


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173582018
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala ---
@@ -90,23 +96,29 @@ class OneHotEncoderSuite
 val encoder = new OneHotEncoder()
   .setInputCol("size")
   .setOutputCol("encoded")
-val output = encoder.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, encoder, "encoded") { 
rows =>
+  val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+  assert(group.size === 2)
+  assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+  assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+}
   }
 
-  test("input column without ML attribute") {
+
+  ignore("input column without ML attribute") {
--- End diff --

Let's keep the test but limit it to batch.  People should switch to 
OneHotEncoderEstimator anyways.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173581428
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala ---
@@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { 
self: Suite =>
   otherResultCols: _*)(globalCheckFunction)
 testTransformerOnDF(dataframe, transformer, firstResultCol,
   otherResultCols: _*)(globalCheckFunction)
+}
+
+  def testTransformerByInterceptingException[A : Encoder](
+dataframe: DataFrame,
+transformer: Transformer,
+expectedMessagePart : String,
+firstResultCol: String) {
+
+def hasExpectedMessage(exception: Throwable): Boolean =
--- End diff --

Yes, that was the reason.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173580629
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala ---
@@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { 
self: Suite =>
   otherResultCols: _*)(globalCheckFunction)
 testTransformerOnDF(dataframe, transformer, firstResultCol,
   otherResultCols: _*)(globalCheckFunction)
+}
+
+  def testTransformerByInterceptingException[A : Encoder](
+dataframe: DataFrame,
+transformer: Transformer,
+expectedMessagePart : String,
+firstResultCol: String) {
+
+def hasExpectedMessage(exception: Throwable): Boolean =
--- End diff --

Just curious: Did you have to add the getCause case because of streaming 
throwing wrapped exceptions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173580016
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala ---
@@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { 
self: Suite =>
   otherResultCols: _*)(globalCheckFunction)
 testTransformerOnDF(dataframe, transformer, firstResultCol,
   otherResultCols: _*)(globalCheckFunction)
+}
+
+  def testTransformerByInterceptingException[A : Encoder](
+dataframe: DataFrame,
+transformer: Transformer,
+expectedMessagePart : String,
+firstResultCol: String) {
+
+def hasExpectedMessage(exception: Throwable): Boolean =
--- End diff --

Since most other tests check parts of the message, I'm OK with this setup.  
When we don't think the message will remain stable, we can pass an empty string 
for expectedMessagePart.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173575838
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173575794
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala 
---
@@ -19,61 +19,59 @@ package org.apache.spark.ml.feature
 
 import scala.beans.BeanInfo
 
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.util.DefaultReadWriteTest
-import org.apache.spark.mllib.util.MLlibTestSparkContext
-import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
+import org.apache.spark.sql.{DataFrame, Row}
+
 
 @BeanInfo
 case class NGramTestData(inputTokens: Array[String], wantedNGrams: 
Array[String])
 
-class NGramSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+class NGramSuite extends MLTest with DefaultReadWriteTest {
 
-  import org.apache.spark.ml.feature.NGramSuite._
   import testImplicits._
 
   test("default behavior yields bigram features") {
 val nGram = new NGram()
   .setInputCol("inputTokens")
   .setOutputCol("nGrams")
-val dataset = Seq(NGramTestData(
+val dataFrame = Seq(NGramTestData(
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173555129
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
-  @transient var l1Normalized: Array[Vector] = _
-  @transient var l2Normalized: Array[Vector] = _
+  @transient val data: Seq[Vector] = Seq(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+Vectors.sparse(3, Seq()))
--- End diff --

I'd prefer to revert these changes.  As far as I know, nothing is broken, 
and this is a common pattern used in many parts of MLlib tests.

I think the main reason to move data around would be to have actual + 
expected values side-by-side for easier reading.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173554643
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala 
---
@@ -19,61 +19,59 @@ package org.apache.spark.ml.feature
 
 import scala.beans.BeanInfo
 
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.util.DefaultReadWriteTest
-import org.apache.spark.mllib.util.MLlibTestSparkContext
-import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
+import org.apache.spark.sql.{DataFrame, Row}
+
 
 @BeanInfo
 case class NGramTestData(inputTokens: Array[String], wantedNGrams: 
Array[String])
 
-class NGramSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+class NGramSuite extends MLTest with DefaultReadWriteTest {
 
-  import org.apache.spark.ml.feature.NGramSuite._
   import testImplicits._
 
   test("default behavior yields bigram features") {
 val nGram = new NGram()
   .setInputCol("inputTokens")
   .setOutputCol("nGrams")
-val dataset = Seq(NGramTestData(
+val dataFrame = Seq(NGramTestData(
--- End diff --

These kinds of changes are not necessary and make the PR a lot longer.  
Would you mind reverting them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r173556190
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
--- End diff --

I will say, though, that I'm happy with moving Normalizer into individual 
tests.  It's weird how it is shared here since it's mutated within tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-06 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r172616958
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -324,19 +352,46 @@ class QuantileDiscretizerSuite
   .setStages(Array(discretizerForCol1, discretizerForCol2, 
discretizerForCol3))
   .fit(df)
 
-val resultForMultiCols = plForMultiCols.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-
-val resultForSingleCol = plForSingleCol.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
+val expected = Seq(
+  (0.0, 0.0, 0.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 3.0),
+  (0.0, 2.0, 4.0),
+  (0.0, 2.0, 4.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 7.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0)
+  ).toDF("result1", "result2", "result3")
+.collect().toSeq
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r172415192
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
-  @transient var l1Normalized: Array[Vector] = _
-  @transient var l2Normalized: Array[Vector] = _
+  @transient val data: Seq[Vector] = Seq(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+Vectors.sparse(3, Seq()))
--- End diff --

ok its a minor issue lets ignore it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r172408255
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), 
("male", "baz", 5))
 .toDF("id", "a", "b")
 val model = formula.fit(original)
+val attr = NominalAttribute.defaultAttr
 val expected = Seq(
 ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
 ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0),
 ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0),
 ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
 ).toDF("id", "a", "b", "features", "label")
-// assert(result.schema.toString == resultSchema.toString)
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

I am also confused about the align rule. @jkbradley what do you think ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r172408009
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -324,19 +352,46 @@ class QuantileDiscretizerSuite
   .setStages(Array(discretizerForCol1, discretizerForCol2, 
discretizerForCol3))
   .fit(df)
 
-val resultForMultiCols = plForMultiCols.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-
-val resultForSingleCol = plForSingleCol.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
+val expected = Seq(
+  (0.0, 0.0, 0.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 3.0),
+  (0.0, 2.0, 4.0),
+  (0.0, 2.0, 4.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 7.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0)
+  ).toDF("result1", "result2", "result3")
+.collect().toSeq
--- End diff --

But I prefer to avoid hardcoding big literal array so that the code is 
easier for maintenance. and following code is enough I think:
```
val expected = plForSingleCol.transform(df).select("result1", "result2", 
"result3").collect()
testTransformerByGlobalCheckFunc[(Double, Double, Double)](
  df,plForSingleCol,
  "result1", "result2","result3") { 
  rows =>assert(rows == expected)
   }
```
There is a similar case here 
https://github.com/apache/spark/pull/20121#discussion_r172288890


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-02 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171949807
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala ---
@@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { 
self: Suite =>
   otherResultCols: _*)(globalCheckFunction)
 testTransformerOnDF(dataframe, transformer, firstResultCol,
   otherResultCols: _*)(globalCheckFunction)
+}
+
+  def testTransformerByInterceptingException[A : Encoder](
+dataframe: DataFrame,
+transformer: Transformer,
+expectedMessagePart : String,
+firstResultCol: String) {
+
+def hasExpectedMessage(exception: Throwable): Boolean =
--- End diff --

It uses contains. I would keep this behaviour as the test is more well 
spoken this way.
@jkbradley?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-02 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171948996
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala ---
@@ -63,13 +68,17 @@ class SQLTransformerSuite
   }
 
   test("SPARK-22538: SQLTransformer should not unpersist given dataset") {
-val df = spark.range(10)
+val df = spark.range(10).toDF()
 df.cache()
 df.count()
 assert(df.storageLevel != StorageLevel.NONE)
-new SQLTransformer()
+val sqlTrans = new SQLTransformer()
   .setStatement("SELECT id + 1 AS id1 FROM __THIS__")
-  .transform(df)
-assert(df.storageLevel != StorageLevel.NONE)
+testTransformerByGlobalCheckFunc[Long](
+  df,
+  sqlTrans,
+  "id1") { rows =>
+  assert(df.storageLevel != StorageLevel.NONE)
--- End diff --

Thanks, I change it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-02 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171941603
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), 
("male", "baz", 5))
 .toDF("id", "a", "b")
 val model = formula.fit(original)
+val attr = NominalAttribute.defaultAttr
 val expected = Seq(
 ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
 ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0),
 ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0),
 ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
 ).toDF("id", "a", "b", "features", "label")
-// assert(result.schema.toString == resultSchema.toString)
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

I am sorry to spending time with this issue but I would like to be 
consistent and keep the rules so what about the following:

~~~
...
)
  .toDF("id", "a", "b", "features", "label")
  .select($"id", ...
~~~

So all indented by two spaces and the dots are aligned. Could you accept 
this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-02 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171938542
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -32,10 +31,20 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   def testRFormulaTransform[A: Encoder](
   dataframe: DataFrame,
   formulaModel: RFormulaModel,
-  expected: DataFrame): Unit = {
+  expected: DataFrame,
+  expectedAttributes: AttributeGroup*): Unit = {
+val resultSchema = formulaModel.transformSchema(dataframe.schema)
+assert(resultSchema.json == expected.schema.json)
--- End diff --

I know they are different but 'schema.json' based compare is more 
restrictive: it contains the metadata as well. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171790543
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), 
("male", "baz", 5))
 .toDF("id", "a", "b")
 val model = formula.fit(original)
+val attr = NominalAttribute.defaultAttr
 val expected = Seq(
 ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
 ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0),
 ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0),
 ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
 ).toDF("id", "a", "b", "features", "label")
-// assert(result.schema.toString == resultSchema.toString)
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

I think maybe
```
...
).toDF("id", "a", "b", "features", "label")
 .select($"id", ...
```
looks beautiful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171790083
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala ---
@@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { 
self: Suite =>
   otherResultCols: _*)(globalCheckFunction)
 testTransformerOnDF(dataframe, transformer, firstResultCol,
   otherResultCols: _*)(globalCheckFunction)
+}
+
+  def testTransformerByInterceptingException[A : Encoder](
+dataframe: DataFrame,
+transformer: Transformer,
+expectedMessagePart : String,
+firstResultCol: String) {
+
+def hasExpectedMessage(exception: Throwable): Boolean =
--- End diff --

I doubt whether the check here is too strict. It require exactly match 
message so when some class modify the exception message then many testcase will 
fail.
Or can we just check the exception type ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171765261
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite
   .setInputCols(Array("size"))
   .setOutputCols(Array("encoded"))
 val model = encoder.fit(df)
-val output = model.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { 
rows =>
+val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+assert(group.size === 2)
+assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+}
--- End diff --

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171765221
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
-  @transient var l1Normalized: Array[Vector] = _
-  @transient var l2Normalized: Array[Vector] = _
+  @transient val data: Seq[Vector] = Seq(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+Vectors.sparse(3, Seq()))
--- End diff --

But '@transient' is about to skipping serialisation for this field


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171764043
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), 
("male", "baz", 5))
 .toDF("id", "a", "b")
 val model = formula.fit(original)
+val attr = NominalAttribute.defaultAttr
 val expected = Seq(
 ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
 ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0),
 ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0),
 ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
 ).toDF("id", "a", "b", "features", "label")
-// assert(result.schema.toString == resultSchema.toString)
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

It was at the level of val +2 extra spaces. Should I indent the dots to the 
same row? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171762299
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala ---
@@ -63,13 +68,17 @@ class SQLTransformerSuite
   }
 
   test("SPARK-22538: SQLTransformer should not unpersist given dataset") {
-val df = spark.range(10)
+val df = spark.range(10).toDF()
 df.cache()
 df.count()
 assert(df.storageLevel != StorageLevel.NONE)
-new SQLTransformer()
+val sqlTrans = new SQLTransformer()
   .setStatement("SELECT id + 1 AS id1 FROM __THIS__")
-  .transform(df)
-assert(df.storageLevel != StorageLevel.NONE)
+testTransformerByGlobalCheckFunc[Long](
+  df,
+  sqlTrans,
+  "id1") { rows =>
+  assert(df.storageLevel != StorageLevel.NONE)
--- End diff --

Move `assert(df.storageLevel != StorageLevel.NONE)` to here seems 
meaningless, because you do not use `rows` parameter. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171760358
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -32,10 +31,20 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   def testRFormulaTransform[A: Encoder](
   dataframe: DataFrame,
   formulaModel: RFormulaModel,
-  expected: DataFrame): Unit = {
+  expected: DataFrame,
+  expectedAttributes: AttributeGroup*): Unit = {
+val resultSchema = formulaModel.transformSchema(dataframe.schema)
+assert(resultSchema.json == expected.schema.json)
+assert(resultSchema == expected.schema)
 val (first +: rest) = expected.schema.fieldNames.toSeq
 val expectedRows = expected.collect()
 testTransformerByGlobalCheckFunc[A](dataframe, formulaModel, first, 
rest: _*) { rows =>
+  assert(rows.head.schema.toString() == resultSchema.toString())
+  for (expectedAttributeGroup <- expectedAttributes) {
+val attributeGroup =
+  
AttributeGroup.fromStructField(rows.head.schema(expectedAttributeGroup.name))
+assert(attributeGroup == expectedAttributeGroup)
--- End diff --

Should we use `===` instead ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171757214
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
-  @transient var l1Normalized: Array[Vector] = _
-  @transient var l2Normalized: Array[Vector] = _
+  @transient val data: Seq[Vector] = Seq(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+Vectors.sparse(3, Seq()))
--- End diff --

I only doubt that when the testsuite object being serialized and then 
deserialized, the `data` will lost. But I am not sure which case serialization 
will occur.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171757269
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite
   .setInputCols(Array("size"))
   .setOutputCols(Array("encoded"))
 val model = encoder.fit(df)
-val output = model.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { 
rows =>
+val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+assert(group.size === 2)
+assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+}
--- End diff --

Discussed with @jkbradley . Agreed with you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171762031
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -538,21 +540,28 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
 
 // Handle unseen labels.
 val formula2 = new RFormula().setFormula("b ~ a + id")
-intercept[SparkException] {
-  formula2.fit(df1).transform(df2).collect()
-}
+testTransformerByInterceptingException[(Int, String, String)](
+  df2,
+  formula2.fit(df1),
+  "Unseen label:",
+  "label")
+
 val model3 = formula2.setHandleInvalid("skip").fit(df1)
 val model4 = formula2.setHandleInvalid("keep").fit(df1)
 
+val attr = NominalAttribute.defaultAttr
 val expected3 = Seq(
   (1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0),
   (2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0)
 ).toDF("id", "a", "b", "features", "label")
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171761570
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -32,10 +31,20 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   def testRFormulaTransform[A: Encoder](
   dataframe: DataFrame,
   formulaModel: RFormulaModel,
-  expected: DataFrame): Unit = {
+  expected: DataFrame,
+  expectedAttributes: AttributeGroup*): Unit = {
+val resultSchema = formulaModel.transformSchema(dataframe.schema)
+assert(resultSchema.json == expected.schema.json)
--- End diff --

You compare `schema.json` instead of `schema.toString`. Are you sure they 
have the same effect ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171762037
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -538,21 +540,28 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
 
 // Handle unseen labels.
 val formula2 = new RFormula().setFormula("b ~ a + id")
-intercept[SparkException] {
-  formula2.fit(df1).transform(df2).collect()
-}
+testTransformerByInterceptingException[(Int, String, String)](
+  df2,
+  formula2.fit(df1),
+  "Unseen label:",
+  "label")
+
 val model3 = formula2.setHandleInvalid("skip").fit(df1)
 val model4 = formula2.setHandleInvalid("keep").fit(df1)
 
+val attr = NominalAttribute.defaultAttr
 val expected3 = Seq(
   (1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0),
   (2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0)
 ).toDF("id", "a", "b", "features", "label")
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
+
 val expected4 = Seq(
   (1, "foo", "zq", Vectors.dense(0.0, 1.0, 1.0), 0.0),
   (2, "bar", "zq", Vectors.dense(1.0, 0.0, 2.0), 0.0),
   (3, "bar", "zy", Vectors.dense(1.0, 0.0, 3.0), 2.0)
 ).toDF("id", "a", "b", "features", "label")
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171761966
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -381,31 +386,31 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
 NumericAttribute.defaultAttr)).toMetadata()
 val original = base.select(base.col("id"), base.col("vec").as("vec2", 
metadata))
 val model = formula.fit(original)
-val result = model.transform(original)
-val attrs = AttributeGroup.fromStructField(result.schema("features"))
+val expected = Seq(
+  (1, Vectors.dense(0.0, 1.0), Vectors.dense(0.0, 1.0), 1.0),
+  (2, Vectors.dense(1.0, 2.0), Vectors.dense(1.0, 2.0), 2.0)
+).toDF("id", "vec2", "features", "label")
+  .select($"id", $"vec2".as("vec2", metadata), $"features", $"label")
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171759634
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite
 
 val df = spark.createDataFrame(sc.parallelize(data), schema)
 
-val dfWithTypes = df
-  .withColumn("shortInput", df("input").cast(ShortType))
-  .withColumn("longInput", df("input").cast(LongType))
-  .withColumn("intInput", df("input").cast(IntegerType))
-  .withColumn("floatInput", df("input").cast(FloatType))
-  .withColumn("decimalInput", df("input").cast(DecimalType(10, 0)))
-
-val cols = Array("input", "shortInput", "longInput", "intInput",
-  "floatInput", "decimalInput")
-for (col <- cols) {
-  val encoder = new OneHotEncoderEstimator()
-.setInputCols(Array(col))
+class NumericTypeWithEncoder[A](val numericType: NumericType)
+  (implicit val encoder: Encoder[(A, Vector)])
+
+val types = Seq(
+  new NumericTypeWithEncoder[Short](ShortType),
+  new NumericTypeWithEncoder[Long](LongType),
+  new NumericTypeWithEncoder[Int](IntegerType),
+  new NumericTypeWithEncoder[Float](FloatType),
+  new NumericTypeWithEncoder[Byte](ByteType),
+  new NumericTypeWithEncoder[Double](DoubleType),
+  new NumericTypeWithEncoder[Decimal](DecimalType(10, 
0))(ExpressionEncoder()))
--- End diff --

Oh I see. This is an syntax issue that `testTransformer` need generic 
parameter. When I design the `testTransformer` helper function, I cannot 
eliminate the generic parameter which make things difficult.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171761941
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with 
DefaultReadWriteTest {
   Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), 
("male", "baz", 5))
 .toDF("id", "a", "b")
 val model = formula.fit(original)
+val attr = NominalAttribute.defaultAttr
 val expected = Seq(
 ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
 ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0),
 ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0),
 ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0)
 ).toDF("id", "a", "b", "features", "label")
-// assert(result.schema.toString == resultSchema.toString)
+  .select($"id", $"a", $"b", $"features", $"label".as("label", 
attr.toMetadata()))
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171657137
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -324,19 +352,46 @@ class QuantileDiscretizerSuite
   .setStages(Array(discretizerForCol1, discretizerForCol2, 
discretizerForCol3))
   .fit(df)
 
-val resultForMultiCols = plForMultiCols.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-
-val resultForSingleCol = plForSingleCol.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
+val expected = Seq(
+  (0.0, 0.0, 0.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 3.0),
+  (0.0, 2.0, 4.0),
+  (0.0, 2.0, 4.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 7.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0)
+  ).toDF("result1", "result2", "result3")
+.collect().toSeq
--- End diff --

I think having a bit bigger array in the test is better then checking 
df.transform result against df.transform result (as the function testTransform 
uses df.transform for DF tests).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171655452
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite
   .setInputCols(Array("size"))
   .setOutputCols(Array("encoded"))
 val model = encoder.fit(df)
-val output = model.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { 
rows =>
+val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+assert(group.size === 2)
+assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+}
--- End diff --

I am just wondering whether it is a good idea to revert attribute tests as 
they are working and checking streaming. Is there any disadvantages keeping 
them? Can you please go into the details why they are not needed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171624983
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
-  @transient var l1Normalized: Array[Vector] = _
-  @transient var l2Normalized: Array[Vector] = _
+  @transient val data: Seq[Vector] = Seq(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+Vectors.sparse(3, Seq()))
--- End diff --

First of all thanks for the review. 

Regarding this specific comment: this way it can be a 'val' and variable 
name and value is close to each other. What is the advantage of separating 
them?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171615849
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite
 
 val df = spark.createDataFrame(sc.parallelize(data), schema)
 
-val dfWithTypes = df
-  .withColumn("shortInput", df("input").cast(ShortType))
-  .withColumn("longInput", df("input").cast(LongType))
-  .withColumn("intInput", df("input").cast(IntegerType))
-  .withColumn("floatInput", df("input").cast(FloatType))
-  .withColumn("decimalInput", df("input").cast(DecimalType(10, 0)))
-
-val cols = Array("input", "shortInput", "longInput", "intInput",
-  "floatInput", "decimalInput")
-for (col <- cols) {
-  val encoder = new OneHotEncoderEstimator()
-.setInputCols(Array(col))
+class NumericTypeWithEncoder[A](val numericType: NumericType)
+  (implicit val encoder: Encoder[(A, Vector)])
+
+val types = Seq(
+  new NumericTypeWithEncoder[Short](ShortType),
+  new NumericTypeWithEncoder[Long](LongType),
+  new NumericTypeWithEncoder[Int](IntegerType),
+  new NumericTypeWithEncoder[Float](FloatType),
+  new NumericTypeWithEncoder[Byte](ByteType),
+  new NumericTypeWithEncoder[Double](DoubleType),
+  new NumericTypeWithEncoder[Decimal](DecimalType(10, 
0))(ExpressionEncoder()))
+
+for (t <- types) {
+  val dfWithTypes = df.select(col("input").cast(t.numericType), 
col("expected"))
+  val estimator = new OneHotEncoderEstimator()
+.setInputCols(Array("input"))
 .setOutputCols(Array("output"))
 .setDropLast(false)
 
-  val model = encoder.fit(dfWithTypes)
-  val encoded = model.transform(dfWithTypes)
-
-  encoded.select("output", "expected").rdd.map { r =>
-(r.getAs[Vector](0), r.getAs[Vector](1))
-  }.collect().foreach { case (vec1, vec2) =>
-assert(vec1 === vec2)
-  }
+  val model = estimator.fit(dfWithTypes)
+  testTransformer(dfWithTypes, model, "output", "expected") {
+case Row(output: Vector, expected: Vector) =>
+  assert(output === expected)
+  }(t.encoder)
--- End diff --

See previous comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171615512
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite
 
 val df = spark.createDataFrame(sc.parallelize(data), schema)
 
-val dfWithTypes = df
-  .withColumn("shortInput", df("input").cast(ShortType))
-  .withColumn("longInput", df("input").cast(LongType))
-  .withColumn("intInput", df("input").cast(IntegerType))
-  .withColumn("floatInput", df("input").cast(FloatType))
-  .withColumn("decimalInput", df("input").cast(DecimalType(10, 0)))
-
-val cols = Array("input", "shortInput", "longInput", "intInput",
-  "floatInput", "decimalInput")
-for (col <- cols) {
-  val encoder = new OneHotEncoderEstimator()
-.setInputCols(Array(col))
+class NumericTypeWithEncoder[A](val numericType: NumericType)
+  (implicit val encoder: Encoder[(A, Vector)])
+
+val types = Seq(
+  new NumericTypeWithEncoder[Short](ShortType),
+  new NumericTypeWithEncoder[Long](LongType),
+  new NumericTypeWithEncoder[Int](IntegerType),
+  new NumericTypeWithEncoder[Float](FloatType),
+  new NumericTypeWithEncoder[Byte](ByteType),
+  new NumericTypeWithEncoder[Double](DoubleType),
+  new NumericTypeWithEncoder[Decimal](DecimalType(10, 
0))(ExpressionEncoder()))
--- End diff --

The reason behind is we cannot pass runtime values (ShortType, LongType, 
...) to the generic function testTransformer. But luckily [context bounds are 
resolved to an implicit 
parameter](https://docs.scala-lang.org/tutorials/FAQ/context-bounds.html#how-are-context-bounds-implemented)
 this is the t.encoder which passed as a last parameter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171519201
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -116,11 +110,12 @@ class OneHotEncoderEstimatorSuite
   .setInputCols(Array("index"))
   .setOutputCols(Array("encoded"))
 val model = encoder.fit(df)
-val output = model.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("0").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("1").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { 
rows =>
+  val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+  assert(group.size === 2)
+  assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("0").withIndex(0))
+  assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("1").withIndex(1))
+}
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171546765
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -364,18 +419,47 @@ class QuantileDiscretizerSuite
   .setOutputCols(Array("result1", "result2", "result3"))
   .setNumBucketsArray(Array(10, 10, 10))
 
-val result1 = discretizerSingleNumBuckets.fit(df).transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-val result2 = discretizerNumBucketsArray.fit(df).transform(df)
-  .select("result1", "result2", "result3")
+val expected = Seq(
+  (0.0, 0.0, 0.0),
+  (1.0, 1.0, 1.0),
+  (1.0, 1.0, 1.0),
+  (2.0, 2.0, 2.0),
+  (2.0, 2.0, 2.0),
+  (2.0, 2.0, 2.0),
+  (3.0, 3.0, 3.0),
+  (4.0, 4.0, 4.0),
+  (4.0, 4.0, 4.0),
+  (5.0, 5.0, 5.0),
+  (5.0, 5.0, 5.0),
+  (5.0, 5.0, 5.0),
+  (6.0, 6.0, 6.0),
+  (6.0, 6.0, 6.0),
+  (7.0, 7.0, 7.0),
+  (8.0, 8.0, 8.0),
+  (8.0, 8.0, 8.0),
+  (9.0, 9.0, 9.0),
+  (9.0, 9.0, 9.0),
+  (9.0, 9.0, 9.0)
+).toDF("result1", "result2", "result3")
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171546691
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -324,19 +352,46 @@ class QuantileDiscretizerSuite
   .setStages(Array(discretizerForCol1, discretizerForCol2, 
discretizerForCol3))
   .fit(df)
 
-val resultForMultiCols = plForMultiCols.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
-
-val resultForSingleCol = plForSingleCol.transform(df)
-  .select("result1", "result2", "result3")
-  .collect()
+val expected = Seq(
+  (0.0, 0.0, 0.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 0.0, 1.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 2.0),
+  (0.0, 1.0, 3.0),
+  (0.0, 2.0, 4.0),
+  (0.0, 2.0, 4.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 2.0, 5.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 6.0),
+  (1.0, 3.0, 7.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 8.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0),
+  (1.0, 4.0, 9.0)
+  ).toDF("result1", "result2", "result3")
+.collect().toSeq
--- End diff --

What about use:
```
val expected = plForSingleCol.transform(df).select("result1", "result2", 
"result3").collect()
```
So that avoid hardcoding the big array.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171517362
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -17,94 +17,72 @@
 
 package org.apache.spark.ml.feature
 
-import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest}
 import org.apache.spark.ml.util.TestingUtils._
-import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Row}
 
 
-class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
+class NormalizerSuite extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
-  @transient var data: Array[Vector] = _
-  @transient var dataFrame: DataFrame = _
-  @transient var normalizer: Normalizer = _
-  @transient var l1Normalized: Array[Vector] = _
-  @transient var l2Normalized: Array[Vector] = _
+  @transient val data: Seq[Vector] = Seq(
+Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+Vectors.dense(0.0, 0.0, 0.0),
+Vectors.dense(0.6, -1.1, -3.0),
+Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))),
+Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))),
+Vectors.sparse(3, Seq()))
--- End diff --

I prefer to put initializing `data` var into `beforeAll`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171543393
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite
 
 val df = spark.createDataFrame(sc.parallelize(data), schema)
 
-val dfWithTypes = df
-  .withColumn("shortInput", df("input").cast(ShortType))
-  .withColumn("longInput", df("input").cast(LongType))
-  .withColumn("intInput", df("input").cast(IntegerType))
-  .withColumn("floatInput", df("input").cast(FloatType))
-  .withColumn("decimalInput", df("input").cast(DecimalType(10, 0)))
-
-val cols = Array("input", "shortInput", "longInput", "intInput",
-  "floatInput", "decimalInput")
-for (col <- cols) {
-  val encoder = new OneHotEncoderEstimator()
-.setInputCols(Array(col))
+class NumericTypeWithEncoder[A](val numericType: NumericType)
+  (implicit val encoder: Encoder[(A, Vector)])
+
+val types = Seq(
+  new NumericTypeWithEncoder[Short](ShortType),
+  new NumericTypeWithEncoder[Long](LongType),
+  new NumericTypeWithEncoder[Int](IntegerType),
+  new NumericTypeWithEncoder[Float](FloatType),
+  new NumericTypeWithEncoder[Byte](ByteType),
+  new NumericTypeWithEncoder[Double](DoubleType),
+  new NumericTypeWithEncoder[Decimal](DecimalType(10, 
0))(ExpressionEncoder()))
+
+for (t <- types) {
+  val dfWithTypes = df.select(col("input").cast(t.numericType), 
col("expected"))
+  val estimator = new OneHotEncoderEstimator()
+.setInputCols(Array("input"))
 .setOutputCols(Array("output"))
 .setDropLast(false)
 
-  val model = encoder.fit(dfWithTypes)
-  val encoded = model.transform(dfWithTypes)
-
-  encoded.select("output", "expected").rdd.map { r =>
-(r.getAs[Vector](0), r.getAs[Vector](1))
-  }.collect().foreach { case (vec1, vec2) =>
-assert(vec1 === vec2)
-  }
+  val model = estimator.fit(dfWithTypes)
+  testTransformer(dfWithTypes, model, "output", "expected") {
+case Row(output: Vector, expected: Vector) =>
+  assert(output === expected)
+  }(t.encoder)
--- End diff --

Why need `(t.encoder)` here ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171542822
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite
 
 val df = spark.createDataFrame(sc.parallelize(data), schema)
 
-val dfWithTypes = df
-  .withColumn("shortInput", df("input").cast(ShortType))
-  .withColumn("longInput", df("input").cast(LongType))
-  .withColumn("intInput", df("input").cast(IntegerType))
-  .withColumn("floatInput", df("input").cast(FloatType))
-  .withColumn("decimalInput", df("input").cast(DecimalType(10, 0)))
-
-val cols = Array("input", "shortInput", "longInput", "intInput",
-  "floatInput", "decimalInput")
-for (col <- cols) {
-  val encoder = new OneHotEncoderEstimator()
-.setInputCols(Array(col))
+class NumericTypeWithEncoder[A](val numericType: NumericType)
+  (implicit val encoder: Encoder[(A, Vector)])
+
+val types = Seq(
+  new NumericTypeWithEncoder[Short](ShortType),
+  new NumericTypeWithEncoder[Long](LongType),
+  new NumericTypeWithEncoder[Int](IntegerType),
+  new NumericTypeWithEncoder[Float](FloatType),
+  new NumericTypeWithEncoder[Byte](ByteType),
+  new NumericTypeWithEncoder[Double](DoubleType),
+  new NumericTypeWithEncoder[Decimal](DecimalType(10, 
0))(ExpressionEncoder()))
--- End diff --

Why not use `Seq(ShortType, LongType, ...)` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171519138
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala
 ---
@@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite
   .setInputCols(Array("size"))
   .setOutputCols(Array("encoded"))
 val model = encoder.fit(df)
-val output = model.transform(df)
-val group = AttributeGroup.fromStructField(output.schema("encoded"))
-assert(group.size === 2)
-assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
-assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { 
rows =>
+val group = 
AttributeGroup.fromStructField(rows.head.schema("encoded"))
+assert(group.size === 2)
+assert(group.getAttr(0) === 
BinaryAttribute.defaultAttr.withName("small").withIndex(0))
+assert(group.getAttr(1) === 
BinaryAttribute.defaultAttr.withName("medium").withIndex(1))
+}
--- End diff --

I think for streaming , we don't need to test functions about attributes, 
so this part just keep old testing code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-03-01 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20686#discussion_r171545178
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala ---
@@ -119,29 +131,41 @@ class OneHotEncoderSuite
 
   test("OneHotEncoder with varying types") {
 val df = stringIndexed()
-val dfWithTypes = df
-  .withColumn("shortLabel", df("labelIndex").cast(ShortType))
-  .withColumn("longLabel", df("labelIndex").cast(LongType))
-  .withColumn("intLabel", df("labelIndex").cast(IntegerType))
-  .withColumn("floatLabel", df("labelIndex").cast(FloatType))
-  .withColumn("decimalLabel", df("labelIndex").cast(DecimalType(10, 
0)))
-val cols = Array("labelIndex", "shortLabel", "longLabel", "intLabel",
-  "floatLabel", "decimalLabel")
-for (col <- cols) {
+val attr = NominalAttribute.defaultAttr.withValues("small", "medium", 
"large")
+val expected = Seq(
+  (0, Vectors.sparse(3, Seq((0, 1.0,
+  (1, Vectors.sparse(3, Seq((2, 1.0,
+  (2, Vectors.sparse(3, Seq((1, 1.0,
+  (3, Vectors.sparse(3, Seq((0, 1.0,
+  (4, Vectors.sparse(3, Seq((0, 1.0,
+  (5, Vectors.sparse(3, Seq((1, 1.0).toDF("id", "expected")
+
+val withExpected = df.join(expected, "id")
+
+class NumericTypeWithEncoder[A](val numericType: NumericType)
+   (implicit val encoder: Encoder[(A, Vector)])
+
+val types = Seq(
+  new NumericTypeWithEncoder[Short](ShortType),
+  new NumericTypeWithEncoder[Long](LongType),
+  new NumericTypeWithEncoder[Int](IntegerType),
+  new NumericTypeWithEncoder[Float](FloatType),
+  new NumericTypeWithEncoder[Byte](ByteType),
+  new NumericTypeWithEncoder[Double](DoubleType),
+  new NumericTypeWithEncoder[Decimal](DecimalType(10, 
0))(ExpressionEncoder()))
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...

2018-02-27 Thread attilapiros
GitHub user attilapiros opened a pull request:

https://github.com/apache/spark/pull/20686

[SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z

# What changes were proposed in this pull request?

Adds structured streaming tests using testTransformer for these suites:

- NGramSuite.scala
- NormalizerSuite.scala
- OneHotEncoderEstimatorSuite.scala
- OneHotEncoderSuite.scala
- PCASuite.scala
- PolynomialExpansionSuite.scala
- QuantileDiscretizerSuite.scala
- RFormulaSuite.scala
- SQLTransformerSuite.scala
- StandardScalerSuite.scala
- StopWordsRemoverSuite.scala
- StringIndexerSuite.scala
- TokenizerSuite.scala
- VectorIndexerSuite.scala
- VectorSizeHintSuite.scala
- VectorSlicerSuite.scala
- Word2VecSuite.scala

# How was this patch tested?

They are unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/attilapiros/spark SPARK-22915

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20686.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20686


commit 4099c85e94fabdc4848acf54a9d2704d4f3f5246
Author: “attilapiros” 
Date:   2018-02-23T17:27:46Z

initial upload




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org