[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20686 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r174656617 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala --- @@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with MLlibTestSparkContext with De val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") -def validateResults(df: DataFrame): Unit = { - df.select("result", "expected").collect().foreach { case Row(vec1: Vector, vec2: Vector) => +def validateResults(rows: Seq[Row]): Unit = { + rows.foreach { case Row(vec1: Vector, vec2: Vector) => assert(vec1 === vec2) } - val resultMetadata = AttributeGroup.fromStructField(df.schema("result")) - val expectedMetadata = AttributeGroup.fromStructField(df.schema("expected")) + val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result")) + val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected")) assert(resultMetadata.numAttributes === expectedMetadata.numAttributes) resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) => assert(a === b) } } vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty) -validateResults(vectorSlicer.transform(df)) +testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")( --- End diff -- I see, makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r174655509 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala --- @@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with MLlibTestSparkContext with De val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") -def validateResults(df: DataFrame): Unit = { - df.select("result", "expected").collect().foreach { case Row(vec1: Vector, vec2: Vector) => +def validateResults(rows: Seq[Row]): Unit = { + rows.foreach { case Row(vec1: Vector, vec2: Vector) => assert(vec1 === vec2) } - val resultMetadata = AttributeGroup.fromStructField(df.schema("result")) - val expectedMetadata = AttributeGroup.fromStructField(df.schema("expected")) + val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result")) + val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected")) assert(resultMetadata.numAttributes === expectedMetadata.numAttributes) resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) => assert(a === b) } } vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty) -validateResults(vectorSlicer.transform(df)) +testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")( --- End diff -- ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r174535931 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -299,18 +310,17 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") -val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), - Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)), - Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)), - Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0))) +val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), --- End diff -- I agree that's correct. The problem is that people tend to see these patterns and copy them without thinking. It's best to follow patterns which help other contributors to avoid making mistakes. I'm OK with leaving it since this issue is scattered throughout MLlib tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r174247908 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala --- @@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with MLlibTestSparkContext with De val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") -def validateResults(df: DataFrame): Unit = { - df.select("result", "expected").collect().foreach { case Row(vec1: Vector, vec2: Vector) => +def validateResults(rows: Seq[Row]): Unit = { + rows.foreach { case Row(vec1: Vector, vec2: Vector) => assert(vec1 === vec2) } - val resultMetadata = AttributeGroup.fromStructField(df.schema("result")) - val expectedMetadata = AttributeGroup.fromStructField(df.schema("expected")) + val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result")) + val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected")) assert(resultMetadata.numAttributes === expectedMetadata.numAttributes) resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) => assert(a === b) } } vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty) -validateResults(vectorSlicer.transform(df)) +testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")( --- End diff -- The reason I have chosen the global check function is the checks for the attributes: ``` val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result")) val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected")) assert(resultMetadata.numAttributes === expectedMetadata.numAttributes) resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) => assert(a === b) } ``` This is part is not row based but more like result set based. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r174245361 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala --- @@ -58,14 +57,16 @@ class VectorAssemblerSuite assert(v2.isInstanceOf[DenseVector]) } - test("VectorAssembler") { + ignore("VectorAssembler") { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173999280 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -86,16 +94,19 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { } } - test("label column already exists but is not numeric type") { + ignore("label column already exists but is not numeric type") { --- End diff -- Thanks, this is a very good catch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173999125 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala --- @@ -58,14 +57,16 @@ class VectorAssemblerSuite assert(v2.isInstanceOf[DenseVector]) } - test("VectorAssembler") { + ignore("VectorAssembler") { --- End diff -- @attilapiros You need revert code here and keep old `VectorAssembler` testsuite here. `VectorAssembler` do not support streaming mode unless you pipeline a `VectorSizeHint` before it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173998034 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -299,18 +310,17 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") -val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), - Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)), - Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)), - Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0))) +val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), --- End diff -- I confirmed this with @cloud-fan If use the pattern: ``` Seq(...).toDF().select(...).collect() ``` will use `localRelation` and will always use one partition to do computation. And the output row order will keep the same with the input seq. and seems many other testcases use similar way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173995919 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -299,18 +310,17 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") -val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), - Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)), - Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)), - Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0))) +val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), --- End diff -- I'd make a little argument that if without shuffling, dataframe transformation will keep row ordering. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173594072 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -299,18 +310,17 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") -val expected = Seq(Set((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), - Set((0, 2.0), (1, 2.0), (2, 0.0), (3, 1.0), (4, 1.0), (5, 2.0)), - Set((0, 1.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 1.0)), - Set((0, 1.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 1.0))) +val expected = Seq(Seq((0, 0.0), (1, 0.0), (2, 2.0), (3, 1.0), (4, 1.0), (5, 0.0)), --- End diff -- I'd keep using Set. The point is to be row-order-agnostic since DataFrames do not guarantee stable row ordering in general. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173592896 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -109,16 +111,14 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") .fit(df) -val transformed = indexer.transform(df) -val attr = Attribute.fromStructField(transformed.schema("labelIndex")) - .asInstanceOf[NominalAttribute] -assert(attr.values.get === Array("100", "300", "200")) -val output = transformed.select("id", "labelIndex").rdd.map { r => - (r.getInt(0), r.getDouble(1)) -}.collect().toSet -// 100 -> 0, 200 -> 2, 300 -> 1 -val expected = Set((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)) -assert(output === expected) +testTransformerByGlobalCheckFunc[(Int, String)](df, indexer, "id", "labelIndex") { rows => + val attr = Attribute.fromStructField(rows.head.schema("labelIndex")) +.asInstanceOf[NominalAttribute] + assert(attr.values.get === Array("100", "300", "200")) + // 100 -> 0, 200 -> 2, 300 -> 1 + val expected = Seq((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)).toDF() --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600685 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala --- @@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext MLTestingUtils.checkCopyAndUids(vectorIndexer, model) -model.transform(densePoints1) // should work -model.transform(sparsePoints1) // should work +// should work +testTransformer[FeatureData](densePoints1, model, "indexed") { _ => } +// should work +testTransformer[FeatureData](sparsePoints1, model, "indexed") { _ => } + // If the data is local Dataset, it throws AssertionError directly. -intercept[AssertionError] { - model.transform(densePoints2).collect() - logInfo("Did not throw error when fit, transform were called on vectors of different lengths") +withClue("Did not found expected error message when fit, " + + "transform were called on vectors of different lengths") { + testTransformerByInterceptingException[FeatureData]( +densePoints2, +model, +"VectorIndexerModel expected vector of length 3 but found length 4", +"indexed") } // If the data is distributed Dataset, it throws SparkException // which is the wrapper of AssertionError. -intercept[SparkException] { - model.transform(densePoints2.repartition(2)).collect() - logInfo("Did not throw error when fit, transform were called on vectors of different lengths") +withClue("Did not found expected error message when fit, " + --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600416 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala --- @@ -58,14 +57,16 @@ class VectorAssemblerSuite assert(v2.isInstanceOf[DenseVector]) } - test("VectorAssembler") { + ignore("VectorAssembler") { --- End diff -- Don't ignore; just test on batch. This case is solved with VectorSizeHint. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173593795 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -247,14 +253,18 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") .fit(df) -val transformed = indexer.transform(df) +val expected1 = Seq(0.0, 2.0, 1.0, 0.0, 0.0, 1.0).map(Tuple1(_)).toDF("labelIndex") --- End diff -- Not needed; this isn't what this unit test is testing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600331 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -86,16 +94,19 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { } } - test("label column already exists but is not numeric type") { + ignore("label column already exists but is not numeric type") { --- End diff -- testTransformerByInterceptingException should use ```(Int, Boolean)``` (not Double) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173584784 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -324,19 +352,24 @@ class QuantileDiscretizerSuite .setStages(Array(discretizerForCol1, discretizerForCol2, discretizerForCol3)) .fit(df) -val resultForMultiCols = plForMultiCols.transform(df) - .select("result1", "result2", "result3") - .collect() +val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect() -val resultForSingleCol = plForSingleCol.transform(df) - .select("result1", "result2", "result3") - .collect() +testTransformerByGlobalCheckFunc[(Double, Double, Double)]( + df, + plForMultiCols, + "result1", + "result2", + "result3") { rows => +assert(rows === expected) + } -resultForSingleCol.zip(resultForMultiCols).foreach { - case (rowForSingle, rowForMultiCols) => -assert(rowForSingle.getDouble(0) == rowForMultiCols.getDouble(0) && - rowForSingle.getDouble(1) == rowForMultiCols.getDouble(1) && - rowForSingle.getDouble(2) == rowForMultiCols.getDouble(2)) +testTransformerByGlobalCheckFunc[(Double, Double, Double)]( --- End diff -- I'd remove this. Testing vs. multiCol is already testing batch vs streaming. No need to test singleCol against itself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173582122 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala --- @@ -90,23 +96,29 @@ class OneHotEncoderSuite val encoder = new OneHotEncoder() .setInputCol("size") .setOutputCol("encoded") -val output = encoder.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, encoder, "encoded") { rows => + val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) + assert(group.size === 2) + assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) + assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +} } - test("input column without ML attribute") { + + ignore("input column without ML attribute") { --- End diff -- Let's keep the test but limit it to batch. People should switch to OneHotEncoderEstimator anyways. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173593049 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -109,16 +111,14 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") .fit(df) -val transformed = indexer.transform(df) -val attr = Attribute.fromStructField(transformed.schema("labelIndex")) - .asInstanceOf[NominalAttribute] -assert(attr.values.get === Array("100", "300", "200")) -val output = transformed.select("id", "labelIndex").rdd.map { r => - (r.getInt(0), r.getDouble(1)) -}.collect().toSet -// 100 -> 0, 200 -> 2, 300 -> 1 -val expected = Set((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)) -assert(output === expected) +testTransformerByGlobalCheckFunc[(Int, String)](df, indexer, "id", "labelIndex") { rows => + val attr = Attribute.fromStructField(rows.head.schema("labelIndex")) +.asInstanceOf[NominalAttribute] + assert(attr.values.get === Array("100", "300", "200")) + // 100 -> 0, 200 -> 2, 300 -> 1 + val expected = Seq((0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)).toDF() --- End diff -- And elsewhere below --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600642 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala --- @@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext MLTestingUtils.checkCopyAndUids(vectorIndexer, model) -model.transform(densePoints1) // should work -model.transform(sparsePoints1) // should work +// should work +testTransformer[FeatureData](densePoints1, model, "indexed") { _ => } +// should work +testTransformer[FeatureData](sparsePoints1, model, "indexed") { _ => } + // If the data is local Dataset, it throws AssertionError directly. -intercept[AssertionError] { - model.transform(densePoints2).collect() - logInfo("Did not throw error when fit, transform were called on vectors of different lengths") +withClue("Did not found expected error message when fit, " + --- End diff -- Or just use the original text --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173587524 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), ("male", "baz", 5)) .toDF("id", "a", "b") val model = formula.fit(original) +val attr = NominalAttribute.defaultAttr val expected = Seq( ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0), ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0), ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0), ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) ).toDF("id", "a", "b", "features", "label") -// assert(result.schema.toString == resultSchema.toString) + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- I'd just keep what you have now. There isn't a great solution here, and what you have fits other code examples in MLlib. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173592635 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -70,36 +71,37 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") .fit(df) + // Verify we throw by default with unseen values -intercept[SparkException] { - indexer.transform(df2).collect() -} +testTransformerByInterceptingException[(Int, String)]( + df2, + indexer, + "Unseen label:", + "labelIndex") indexer.setHandleInvalid("skip") -// Verify that we skip the c record -val transformedSkip = indexer.transform(df2) -val attrSkip = Attribute.fromStructField(transformedSkip.schema("labelIndex")) - .asInstanceOf[NominalAttribute] -assert(attrSkip.values.get === Array("b", "a")) -val outputSkip = transformedSkip.select("id", "labelIndex").rdd.map { r => - (r.getInt(0), r.getDouble(1)) -}.collect().toSet -// a -> 1, b -> 0 -val expectedSkip = Set((0, 1.0), (1, 0.0)) -assert(outputSkip === expectedSkip) + +testTransformerByGlobalCheckFunc[(Int, String)](df2, indexer, "id", "labelIndex") { rows => + val attrSkip = Attribute.fromStructField(rows.head.schema("labelIndex")) +.asInstanceOf[NominalAttribute] + assert(attrSkip.values.get === Array("b", "a")) + // Verify that we skip the c record + // a -> 1, b -> 0 + val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF() --- End diff -- This can be moved outside of the testTransformerByGlobalCheckFunc method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173594378 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -328,7 +338,12 @@ class StringIndexerSuite .setOutputCol("CITYIndexed") .fit(dfNoBristol) -val dfWithIndex = model.transform(dfNoBristol) -assert(dfWithIndex.filter($"CITYIndexed" === 1.0).count == 1) +testTransformerByGlobalCheckFunc[(String, String, String)]( + dfNoBristol, + model, + "CITYIndexed") { rows => + val transformed = rows.map { r => r.getDouble(0) }.toDF("CITYIndexed") --- End diff -- It's probably easier to avoid going through a DataFrame here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600463 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala --- @@ -76,16 +77,18 @@ class VectorAssemblerSuite val assembler = new VectorAssembler() .setInputCols(Array("a", "b", "c")) .setOutputCol("features") -val thrown = intercept[IllegalArgumentException] { - assembler.transform(df) -} -assert(thrown.getMessage contains +testTransformerByInterceptingException[(String, String, String)]( + df, + assembler, "Data type StringType of column a is not supported.\n" + "Data type StringType of column b is not supported.\n" + - "Data type StringType of column c is not supported.") + "Data type StringType of column c is not supported.", + "features") } - test("ML attributes") { + ignore("ML attributes") { --- End diff -- ditto: do not ignore --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173876598 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorSlicerSuite.scala --- @@ -84,26 +84,29 @@ class VectorSlicerSuite extends SparkFunSuite with MLlibTestSparkContext with De val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result") -def validateResults(df: DataFrame): Unit = { - df.select("result", "expected").collect().foreach { case Row(vec1: Vector, vec2: Vector) => +def validateResults(rows: Seq[Row]): Unit = { + rows.foreach { case Row(vec1: Vector, vec2: Vector) => assert(vec1 === vec2) } - val resultMetadata = AttributeGroup.fromStructField(df.schema("result")) - val expectedMetadata = AttributeGroup.fromStructField(df.schema("expected")) + val resultMetadata = AttributeGroup.fromStructField(rows.head.schema("result")) + val expectedMetadata = AttributeGroup.fromStructField(rows.head.schema("expected")) assert(resultMetadata.numAttributes === expectedMetadata.numAttributes) resultMetadata.attributes.get.zip(expectedMetadata.attributes.get).foreach { case (a, b) => assert(a === b) } } vectorSlicer.setIndices(Array(1, 4)).setNames(Array.empty) -validateResults(vectorSlicer.transform(df)) +testTransformerByGlobalCheckFunc[(Vector, Vector)](df, vectorSlicer, "result", "expected")( --- End diff -- Avoid using a global check function when you don't need to. It'd be better to use testTransformer() since the test is per-row. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173593885 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -247,14 +253,18 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") .fit(df) -val transformed = indexer.transform(df) +val expected1 = Seq(0.0, 2.0, 1.0, 0.0, 0.0, 1.0).map(Tuple1(_)).toDF("labelIndex") +testTransformerByGlobalCheckFunc[(Int, String)](df, indexer, "labelIndex") { rows => + assert(rows == expected1.collect().seq) +} + val idx2str = new IndexToString() .setInputCol("labelIndex") .setOutputCol("sameLabel") .setLabels(indexer.labels) -idx2str.transform(transformed).select("label", "sameLabel").collect().foreach { - case Row(a: String, b: String) => -assert(a === b) + +testTransformerByGlobalCheckFunc[(Double)](expected1, idx2str, "sameLabel") { rows => --- End diff -- You should be able to test per-row, rather than using a global check function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173592811 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -70,36 +71,37 @@ class StringIndexerSuite .setInputCol("label") .setOutputCol("labelIndex") .fit(df) + // Verify we throw by default with unseen values -intercept[SparkException] { - indexer.transform(df2).collect() -} +testTransformerByInterceptingException[(Int, String)]( + df2, + indexer, + "Unseen label:", + "labelIndex") indexer.setHandleInvalid("skip") -// Verify that we skip the c record -val transformedSkip = indexer.transform(df2) -val attrSkip = Attribute.fromStructField(transformedSkip.schema("labelIndex")) - .asInstanceOf[NominalAttribute] -assert(attrSkip.values.get === Array("b", "a")) -val outputSkip = transformedSkip.select("id", "labelIndex").rdd.map { r => - (r.getInt(0), r.getDouble(1)) -}.collect().toSet -// a -> 1, b -> 0 -val expectedSkip = Set((0, 1.0), (1, 0.0)) -assert(outputSkip === expectedSkip) + +testTransformerByGlobalCheckFunc[(Int, String)](df2, indexer, "id", "labelIndex") { rows => + val attrSkip = Attribute.fromStructField(rows.head.schema("labelIndex")) +.asInstanceOf[NominalAttribute] + assert(attrSkip.values.get === Array("b", "a")) + // Verify that we skip the c record + // a -> 1, b -> 0 + val expectedSkip = Seq((0, 1.0), (1, 0.0)).toDF() + assert(rows.seq === expectedSkip.collect().toSeq) +} indexer.setHandleInvalid("keep") + // Verify that we keep the unseen records -val transformedKeep = indexer.transform(df2) -val attrKeep = Attribute.fromStructField(transformedKeep.schema("labelIndex")) - .asInstanceOf[NominalAttribute] -assert(attrKeep.values.get === Array("b", "a", "__unknown")) -val outputKeep = transformedKeep.select("id", "labelIndex").rdd.map { r => - (r.getInt(0), r.getDouble(1)) -}.collect().toSet -// a -> 1, b -> 0, c -> 2, d -> 3 -val expectedKeep = Set((0, 1.0), (1, 0.0), (2, 2.0), (3, 2.0)) -assert(outputKeep === expectedKeep) +testTransformerByGlobalCheckFunc[(Int, String)](df2, indexer, "id", "labelIndex") { rows => + val attrKeep = Attribute.fromStructField(rows.head.schema("labelIndex")) +.asInstanceOf[NominalAttribute] + assert(attrKeep.values.get === Array("b", "a", "__unknown")) + // a -> 1, b -> 0, c -> 2, d -> 3 + val expectedKeep = Seq((0, 1.0), (1, 0.0), (2, 2.0), (3, 2.0)).toDF() --- End diff -- ditto: move outside checkFunc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600557 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala --- @@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext MLTestingUtils.checkCopyAndUids(vectorIndexer, model) -model.transform(densePoints1) // should work -model.transform(sparsePoints1) // should work +// should work +testTransformer[FeatureData](densePoints1, model, "indexed") { _ => } +// should work +testTransformer[FeatureData](sparsePoints1, model, "indexed") { _ => } + // If the data is local Dataset, it throws AssertionError directly. -intercept[AssertionError] { - model.transform(densePoints2).collect() - logInfo("Did not throw error when fit, transform were called on vectors of different lengths") +withClue("Did not found expected error message when fit, " + --- End diff -- found -> find --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173584864 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -364,18 +397,26 @@ class QuantileDiscretizerSuite .setOutputCols(Array("result1", "result2", "result3")) .setNumBucketsArray(Array(10, 10, 10)) -val result1 = discretizerSingleNumBuckets.fit(df).transform(df) - .select("result1", "result2", "result3") - .collect() -val result2 = discretizerNumBucketsArray.fit(df).transform(df) - .select("result1", "result2", "result3") - .collect() - -result1.zip(result2).foreach { - case (row1, row2) => -assert(row1.getDouble(0) == row2.getDouble(0) && - row1.getDouble(1) == row2.getDouble(1) && - row1.getDouble(2) == row2.getDouble(2)) +val model = discretizerSingleNumBuckets.fit(df) +val expected = model.transform(df).select("result1", "result2", "result3").collect() + + +testTransformerByGlobalCheckFunc[(Double, Double, Double)]( + df, + model, + "result1", + "result2", + "result3") { rows => + assert(rows === expected) +} + +testTransformerByGlobalCheckFunc[(Double, Double, Double)]( --- End diff -- Is this a repeat of the test just above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173600517 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala --- @@ -128,18 +126,29 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext MLTestingUtils.checkCopyAndUids(vectorIndexer, model) -model.transform(densePoints1) // should work -model.transform(sparsePoints1) // should work +// should work --- End diff -- We can remove "should work" comments : P --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173582018 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala --- @@ -90,23 +96,29 @@ class OneHotEncoderSuite val encoder = new OneHotEncoder() .setInputCol("size") .setOutputCol("encoded") -val output = encoder.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, encoder, "encoded") { rows => + val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) + assert(group.size === 2) + assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) + assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +} } - test("input column without ML attribute") { + + ignore("input column without ML attribute") { --- End diff -- Let's keep the test but limit it to batch. People should switch to OneHotEncoderEstimator anyways. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173581428 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala --- @@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { self: Suite => otherResultCols: _*)(globalCheckFunction) testTransformerOnDF(dataframe, transformer, firstResultCol, otherResultCols: _*)(globalCheckFunction) +} + + def testTransformerByInterceptingException[A : Encoder]( +dataframe: DataFrame, +transformer: Transformer, +expectedMessagePart : String, +firstResultCol: String) { + +def hasExpectedMessage(exception: Throwable): Boolean = --- End diff -- Yes, that was the reason. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173580629 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala --- @@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { self: Suite => otherResultCols: _*)(globalCheckFunction) testTransformerOnDF(dataframe, transformer, firstResultCol, otherResultCols: _*)(globalCheckFunction) +} + + def testTransformerByInterceptingException[A : Encoder]( +dataframe: DataFrame, +transformer: Transformer, +expectedMessagePart : String, +firstResultCol: String) { + +def hasExpectedMessage(exception: Throwable): Boolean = --- End diff -- Just curious: Did you have to add the getCause case because of streaming throwing wrapped exceptions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173580016 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala --- @@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { self: Suite => otherResultCols: _*)(globalCheckFunction) testTransformerOnDF(dataframe, transformer, firstResultCol, otherResultCols: _*)(globalCheckFunction) +} + + def testTransformerByInterceptingException[A : Encoder]( +dataframe: DataFrame, +transformer: Transformer, +expectedMessagePart : String, +firstResultCol: String) { + +def hasExpectedMessage(exception: Throwable): Boolean = --- End diff -- Since most other tests check parts of the message, I'm OK with this setup. When we don't think the message will remain stable, we can pass an empty string for expectedMessagePart. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173575838 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173575794 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala --- @@ -19,61 +19,59 @@ package org.apache.spark.ml.feature import scala.beans.BeanInfo -import org.apache.spark.SparkFunSuite -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.util.MLlibTestSparkContext -import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} +import org.apache.spark.sql.{DataFrame, Row} + @BeanInfo case class NGramTestData(inputTokens: Array[String], wantedNGrams: Array[String]) -class NGramSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NGramSuite extends MLTest with DefaultReadWriteTest { - import org.apache.spark.ml.feature.NGramSuite._ import testImplicits._ test("default behavior yields bigram features") { val nGram = new NGram() .setInputCol("inputTokens") .setOutputCol("nGrams") -val dataset = Seq(NGramTestData( +val dataFrame = Seq(NGramTestData( --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173555129 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ - @transient var l1Normalized: Array[Vector] = _ - @transient var l2Normalized: Array[Vector] = _ + @transient val data: Seq[Vector] = Seq( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), +Vectors.sparse(3, Seq())) --- End diff -- I'd prefer to revert these changes. As far as I know, nothing is broken, and this is a common pattern used in many parts of MLlib tests. I think the main reason to move data around would be to have actual + expected values side-by-side for easier reading. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173554643 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala --- @@ -19,61 +19,59 @@ package org.apache.spark.ml.feature import scala.beans.BeanInfo -import org.apache.spark.SparkFunSuite -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.util.MLlibTestSparkContext -import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} +import org.apache.spark.sql.{DataFrame, Row} + @BeanInfo case class NGramTestData(inputTokens: Array[String], wantedNGrams: Array[String]) -class NGramSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NGramSuite extends MLTest with DefaultReadWriteTest { - import org.apache.spark.ml.feature.NGramSuite._ import testImplicits._ test("default behavior yields bigram features") { val nGram = new NGram() .setInputCol("inputTokens") .setOutputCol("nGrams") -val dataset = Seq(NGramTestData( +val dataFrame = Seq(NGramTestData( --- End diff -- These kinds of changes are not necessary and make the PR a lot longer. Would you mind reverting them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r173556190 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ --- End diff -- I will say, though, that I'm happy with moving Normalizer into individual tests. It's weird how it is shared here since it's mutated within tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r172616958 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -324,19 +352,46 @@ class QuantileDiscretizerSuite .setStages(Array(discretizerForCol1, discretizerForCol2, discretizerForCol3)) .fit(df) -val resultForMultiCols = plForMultiCols.transform(df) - .select("result1", "result2", "result3") - .collect() - -val resultForSingleCol = plForSingleCol.transform(df) - .select("result1", "result2", "result3") - .collect() +val expected = Seq( + (0.0, 0.0, 0.0), + (0.0, 0.0, 1.0), + (0.0, 0.0, 1.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 3.0), + (0.0, 2.0, 4.0), + (0.0, 2.0, 4.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 7.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0) + ).toDF("result1", "result2", "result3") +.collect().toSeq --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r172415192 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ - @transient var l1Normalized: Array[Vector] = _ - @transient var l2Normalized: Array[Vector] = _ + @transient val data: Seq[Vector] = Seq( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), +Vectors.sparse(3, Seq())) --- End diff -- ok its a minor issue lets ignore it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r172408255 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), ("male", "baz", 5)) .toDF("id", "a", "b") val model = formula.fit(original) +val attr = NominalAttribute.defaultAttr val expected = Seq( ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0), ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0), ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0), ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) ).toDF("id", "a", "b", "features", "label") -// assert(result.schema.toString == resultSchema.toString) + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- I am also confused about the align rule. @jkbradley what do you think ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r172408009 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -324,19 +352,46 @@ class QuantileDiscretizerSuite .setStages(Array(discretizerForCol1, discretizerForCol2, discretizerForCol3)) .fit(df) -val resultForMultiCols = plForMultiCols.transform(df) - .select("result1", "result2", "result3") - .collect() - -val resultForSingleCol = plForSingleCol.transform(df) - .select("result1", "result2", "result3") - .collect() +val expected = Seq( + (0.0, 0.0, 0.0), + (0.0, 0.0, 1.0), + (0.0, 0.0, 1.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 3.0), + (0.0, 2.0, 4.0), + (0.0, 2.0, 4.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 7.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0) + ).toDF("result1", "result2", "result3") +.collect().toSeq --- End diff -- But I prefer to avoid hardcoding big literal array so that the code is easier for maintenance. and following code is enough I think: ``` val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect() testTransformerByGlobalCheckFunc[(Double, Double, Double)]( df,plForSingleCol, "result1", "result2","result3") { rows =>assert(rows == expected) } ``` There is a similar case here https://github.com/apache/spark/pull/20121#discussion_r172288890 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171949807 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala --- @@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { self: Suite => otherResultCols: _*)(globalCheckFunction) testTransformerOnDF(dataframe, transformer, firstResultCol, otherResultCols: _*)(globalCheckFunction) +} + + def testTransformerByInterceptingException[A : Encoder]( +dataframe: DataFrame, +transformer: Transformer, +expectedMessagePart : String, +firstResultCol: String) { + +def hasExpectedMessage(exception: Throwable): Boolean = --- End diff -- It uses contains. I would keep this behaviour as the test is more well spoken this way. @jkbradley? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171948996 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala --- @@ -63,13 +68,17 @@ class SQLTransformerSuite } test("SPARK-22538: SQLTransformer should not unpersist given dataset") { -val df = spark.range(10) +val df = spark.range(10).toDF() df.cache() df.count() assert(df.storageLevel != StorageLevel.NONE) -new SQLTransformer() +val sqlTrans = new SQLTransformer() .setStatement("SELECT id + 1 AS id1 FROM __THIS__") - .transform(df) -assert(df.storageLevel != StorageLevel.NONE) +testTransformerByGlobalCheckFunc[Long]( + df, + sqlTrans, + "id1") { rows => + assert(df.storageLevel != StorageLevel.NONE) --- End diff -- Thanks, I change it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171941603 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), ("male", "baz", 5)) .toDF("id", "a", "b") val model = formula.fit(original) +val attr = NominalAttribute.defaultAttr val expected = Seq( ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0), ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0), ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0), ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) ).toDF("id", "a", "b", "features", "label") -// assert(result.schema.toString == resultSchema.toString) + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- I am sorry to spending time with this issue but I would like to be consistent and keep the rules so what about the following: ~~~ ... ) .toDF("id", "a", "b", "features", "label") .select($"id", ... ~~~ So all indented by two spaces and the dots are aligned. Could you accept this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171938542 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -32,10 +31,20 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { def testRFormulaTransform[A: Encoder]( dataframe: DataFrame, formulaModel: RFormulaModel, - expected: DataFrame): Unit = { + expected: DataFrame, + expectedAttributes: AttributeGroup*): Unit = { +val resultSchema = formulaModel.transformSchema(dataframe.schema) +assert(resultSchema.json == expected.schema.json) --- End diff -- I know they are different but 'schema.json' based compare is more restrictive: it contains the metadata as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171790543 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), ("male", "baz", 5)) .toDF("id", "a", "b") val model = formula.fit(original) +val attr = NominalAttribute.defaultAttr val expected = Seq( ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0), ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0), ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0), ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) ).toDF("id", "a", "b", "features", "label") -// assert(result.schema.toString == resultSchema.toString) + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- I think maybe ``` ... ).toDF("id", "a", "b", "features", "label") .select($"id", ... ``` looks beautiful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171790083 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala --- @@ -108,5 +111,29 @@ trait MLTest extends StreamTest with TempDirectory { self: Suite => otherResultCols: _*)(globalCheckFunction) testTransformerOnDF(dataframe, transformer, firstResultCol, otherResultCols: _*)(globalCheckFunction) +} + + def testTransformerByInterceptingException[A : Encoder]( +dataframe: DataFrame, +transformer: Transformer, +expectedMessagePart : String, +firstResultCol: String) { + +def hasExpectedMessage(exception: Throwable): Boolean = --- End diff -- I doubt whether the check here is too strict. It require exactly match message so when some class modify the exception message then many testcase will fail. Or can we just check the exception type ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171765261 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite .setInputCols(Array("size")) .setOutputCols(Array("encoded")) val model = encoder.fit(df) -val output = model.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { rows => +val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) +assert(group.size === 2) +assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) +assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +} --- End diff -- Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171765221 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ - @transient var l1Normalized: Array[Vector] = _ - @transient var l2Normalized: Array[Vector] = _ + @transient val data: Seq[Vector] = Seq( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), +Vectors.sparse(3, Seq())) --- End diff -- But '@transient' is about to skipping serialisation for this field --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171764043 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), ("male", "baz", 5)) .toDF("id", "a", "b") val model = formula.fit(original) +val attr = NominalAttribute.defaultAttr val expected = Seq( ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0), ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0), ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0), ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) ).toDF("id", "a", "b", "features", "label") -// assert(result.schema.toString == resultSchema.toString) + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- It was at the level of val +2 extra spaces. Should I indent the dots to the same row? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171762299 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala --- @@ -63,13 +68,17 @@ class SQLTransformerSuite } test("SPARK-22538: SQLTransformer should not unpersist given dataset") { -val df = spark.range(10) +val df = spark.range(10).toDF() df.cache() df.count() assert(df.storageLevel != StorageLevel.NONE) -new SQLTransformer() +val sqlTrans = new SQLTransformer() .setStatement("SELECT id + 1 AS id1 FROM __THIS__") - .transform(df) -assert(df.storageLevel != StorageLevel.NONE) +testTransformerByGlobalCheckFunc[Long]( + df, + sqlTrans, + "id1") { rows => + assert(df.storageLevel != StorageLevel.NONE) --- End diff -- Move `assert(df.storageLevel != StorageLevel.NONE)` to here seems meaningless, because you do not use `rows` parameter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171760358 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -32,10 +31,20 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { def testRFormulaTransform[A: Encoder]( dataframe: DataFrame, formulaModel: RFormulaModel, - expected: DataFrame): Unit = { + expected: DataFrame, + expectedAttributes: AttributeGroup*): Unit = { +val resultSchema = formulaModel.transformSchema(dataframe.schema) +assert(resultSchema.json == expected.schema.json) +assert(resultSchema == expected.schema) val (first +: rest) = expected.schema.fieldNames.toSeq val expectedRows = expected.collect() testTransformerByGlobalCheckFunc[A](dataframe, formulaModel, first, rest: _*) { rows => + assert(rows.head.schema.toString() == resultSchema.toString()) + for (expectedAttributeGroup <- expectedAttributes) { +val attributeGroup = + AttributeGroup.fromStructField(rows.head.schema(expectedAttributeGroup.name)) +assert(attributeGroup == expectedAttributeGroup) --- End diff -- Should we use `===` instead ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171757214 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ - @transient var l1Normalized: Array[Vector] = _ - @transient var l2Normalized: Array[Vector] = _ + @transient val data: Seq[Vector] = Seq( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), +Vectors.sparse(3, Seq())) --- End diff -- I only doubt that when the testsuite object being serialized and then deserialized, the `data` will lost. But I am not sure which case serialization will occur. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171757269 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite .setInputCols(Array("size")) .setOutputCols(Array("encoded")) val model = encoder.fit(df) -val output = model.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { rows => +val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) +assert(group.size === 2) +assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) +assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +} --- End diff -- Discussed with @jkbradley . Agreed with you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171762031 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -538,21 +540,28 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { // Handle unseen labels. val formula2 = new RFormula().setFormula("b ~ a + id") -intercept[SparkException] { - formula2.fit(df1).transform(df2).collect() -} +testTransformerByInterceptingException[(Int, String, String)]( + df2, + formula2.fit(df1), + "Unseen label:", + "label") + val model3 = formula2.setHandleInvalid("skip").fit(df1) val model4 = formula2.setHandleInvalid("keep").fit(df1) +val attr = NominalAttribute.defaultAttr val expected3 = Seq( (1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0), (2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0) ).toDF("id", "a", "b", "features", "label") + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171761570 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -32,10 +31,20 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { def testRFormulaTransform[A: Encoder]( dataframe: DataFrame, formulaModel: RFormulaModel, - expected: DataFrame): Unit = { + expected: DataFrame, + expectedAttributes: AttributeGroup*): Unit = { +val resultSchema = formulaModel.transformSchema(dataframe.schema) +assert(resultSchema.json == expected.schema.json) --- End diff -- You compare `schema.json` instead of `schema.toString`. Are you sure they have the same effect ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171762037 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -538,21 +540,28 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { // Handle unseen labels. val formula2 = new RFormula().setFormula("b ~ a + id") -intercept[SparkException] { - formula2.fit(df1).transform(df2).collect() -} +testTransformerByInterceptingException[(Int, String, String)]( + df2, + formula2.fit(df1), + "Unseen label:", + "label") + val model3 = formula2.setHandleInvalid("skip").fit(df1) val model4 = formula2.setHandleInvalid("keep").fit(df1) +val attr = NominalAttribute.defaultAttr val expected3 = Seq( (1, "foo", "zq", Vectors.dense(0.0, 1.0), 0.0), (2, "bar", "zq", Vectors.dense(1.0, 2.0), 0.0) ).toDF("id", "a", "b", "features", "label") + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) + val expected4 = Seq( (1, "foo", "zq", Vectors.dense(0.0, 1.0, 1.0), 0.0), (2, "bar", "zq", Vectors.dense(1.0, 0.0, 2.0), 0.0), (3, "bar", "zy", Vectors.dense(1.0, 0.0, 3.0), 2.0) ).toDF("id", "a", "b", "features", "label") + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171761966 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -381,31 +386,31 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { NumericAttribute.defaultAttr)).toMetadata() val original = base.select(base.col("id"), base.col("vec").as("vec2", metadata)) val model = formula.fit(original) -val result = model.transform(original) -val attrs = AttributeGroup.fromStructField(result.schema("features")) +val expected = Seq( + (1, Vectors.dense(0.0, 1.0), Vectors.dense(0.0, 1.0), 1.0), + (2, Vectors.dense(1.0, 2.0), Vectors.dense(1.0, 2.0), 2.0) +).toDF("id", "vec2", "features", "label") + .select($"id", $"vec2".as("vec2", metadata), $"features", $"label") --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171759634 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite val df = spark.createDataFrame(sc.parallelize(data), schema) -val dfWithTypes = df - .withColumn("shortInput", df("input").cast(ShortType)) - .withColumn("longInput", df("input").cast(LongType)) - .withColumn("intInput", df("input").cast(IntegerType)) - .withColumn("floatInput", df("input").cast(FloatType)) - .withColumn("decimalInput", df("input").cast(DecimalType(10, 0))) - -val cols = Array("input", "shortInput", "longInput", "intInput", - "floatInput", "decimalInput") -for (col <- cols) { - val encoder = new OneHotEncoderEstimator() -.setInputCols(Array(col)) +class NumericTypeWithEncoder[A](val numericType: NumericType) + (implicit val encoder: Encoder[(A, Vector)]) + +val types = Seq( + new NumericTypeWithEncoder[Short](ShortType), + new NumericTypeWithEncoder[Long](LongType), + new NumericTypeWithEncoder[Int](IntegerType), + new NumericTypeWithEncoder[Float](FloatType), + new NumericTypeWithEncoder[Byte](ByteType), + new NumericTypeWithEncoder[Double](DoubleType), + new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) --- End diff -- Oh I see. This is an syntax issue that `testTransformer` need generic parameter. When I design the `testTransformer` helper function, I cannot eliminate the generic parameter which make things difficult. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171761941 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -313,13 +306,14 @@ class RFormulaSuite extends MLTest with DefaultReadWriteTest { Seq(("male", "foo", 4), ("female", "bar", 4), ("female", "bar", 5), ("male", "baz", 5)) .toDF("id", "a", "b") val model = formula.fit(original) +val attr = NominalAttribute.defaultAttr val expected = Seq( ("male", "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0), ("female", "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 0.0), ("female", "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 0.0), ("male", "baz", 5, Vectors.dense(0.0, 0.0, 5.0), 1.0) ).toDF("id", "a", "b", "features", "label") -// assert(result.schema.toString == resultSchema.toString) + .select($"id", $"a", $"b", $"features", $"label".as("label", attr.toMetadata())) --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171657137 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -324,19 +352,46 @@ class QuantileDiscretizerSuite .setStages(Array(discretizerForCol1, discretizerForCol2, discretizerForCol3)) .fit(df) -val resultForMultiCols = plForMultiCols.transform(df) - .select("result1", "result2", "result3") - .collect() - -val resultForSingleCol = plForSingleCol.transform(df) - .select("result1", "result2", "result3") - .collect() +val expected = Seq( + (0.0, 0.0, 0.0), + (0.0, 0.0, 1.0), + (0.0, 0.0, 1.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 3.0), + (0.0, 2.0, 4.0), + (0.0, 2.0, 4.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 7.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0) + ).toDF("result1", "result2", "result3") +.collect().toSeq --- End diff -- I think having a bit bigger array in the test is better then checking df.transform result against df.transform result (as the function testTransform uses df.transform for DF tests). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171655452 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite .setInputCols(Array("size")) .setOutputCols(Array("encoded")) val model = encoder.fit(df) -val output = model.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { rows => +val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) +assert(group.size === 2) +assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) +assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +} --- End diff -- I am just wondering whether it is a good idea to revert attribute tests as they are working and checking streaming. Is there any disadvantages keeping them? Can you please go into the details why they are not needed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171624983 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ - @transient var l1Normalized: Array[Vector] = _ - @transient var l2Normalized: Array[Vector] = _ + @transient val data: Seq[Vector] = Seq( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), +Vectors.sparse(3, Seq())) --- End diff -- First of all thanks for the review. Regarding this specific comment: this way it can be a 'val' and variable name and value is close to each other. What is the advantage of separating them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171615849 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite val df = spark.createDataFrame(sc.parallelize(data), schema) -val dfWithTypes = df - .withColumn("shortInput", df("input").cast(ShortType)) - .withColumn("longInput", df("input").cast(LongType)) - .withColumn("intInput", df("input").cast(IntegerType)) - .withColumn("floatInput", df("input").cast(FloatType)) - .withColumn("decimalInput", df("input").cast(DecimalType(10, 0))) - -val cols = Array("input", "shortInput", "longInput", "intInput", - "floatInput", "decimalInput") -for (col <- cols) { - val encoder = new OneHotEncoderEstimator() -.setInputCols(Array(col)) +class NumericTypeWithEncoder[A](val numericType: NumericType) + (implicit val encoder: Encoder[(A, Vector)]) + +val types = Seq( + new NumericTypeWithEncoder[Short](ShortType), + new NumericTypeWithEncoder[Long](LongType), + new NumericTypeWithEncoder[Int](IntegerType), + new NumericTypeWithEncoder[Float](FloatType), + new NumericTypeWithEncoder[Byte](ByteType), + new NumericTypeWithEncoder[Double](DoubleType), + new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) + +for (t <- types) { + val dfWithTypes = df.select(col("input").cast(t.numericType), col("expected")) + val estimator = new OneHotEncoderEstimator() +.setInputCols(Array("input")) .setOutputCols(Array("output")) .setDropLast(false) - val model = encoder.fit(dfWithTypes) - val encoded = model.transform(dfWithTypes) - - encoded.select("output", "expected").rdd.map { r => -(r.getAs[Vector](0), r.getAs[Vector](1)) - }.collect().foreach { case (vec1, vec2) => -assert(vec1 === vec2) - } + val model = estimator.fit(dfWithTypes) + testTransformer(dfWithTypes, model, "output", "expected") { +case Row(output: Vector, expected: Vector) => + assert(output === expected) + }(t.encoder) --- End diff -- See previous comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171615512 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite val df = spark.createDataFrame(sc.parallelize(data), schema) -val dfWithTypes = df - .withColumn("shortInput", df("input").cast(ShortType)) - .withColumn("longInput", df("input").cast(LongType)) - .withColumn("intInput", df("input").cast(IntegerType)) - .withColumn("floatInput", df("input").cast(FloatType)) - .withColumn("decimalInput", df("input").cast(DecimalType(10, 0))) - -val cols = Array("input", "shortInput", "longInput", "intInput", - "floatInput", "decimalInput") -for (col <- cols) { - val encoder = new OneHotEncoderEstimator() -.setInputCols(Array(col)) +class NumericTypeWithEncoder[A](val numericType: NumericType) + (implicit val encoder: Encoder[(A, Vector)]) + +val types = Seq( + new NumericTypeWithEncoder[Short](ShortType), + new NumericTypeWithEncoder[Long](LongType), + new NumericTypeWithEncoder[Int](IntegerType), + new NumericTypeWithEncoder[Float](FloatType), + new NumericTypeWithEncoder[Byte](ByteType), + new NumericTypeWithEncoder[Double](DoubleType), + new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) --- End diff -- The reason behind is we cannot pass runtime values (ShortType, LongType, ...) to the generic function testTransformer. But luckily [context bounds are resolved to an implicit parameter](https://docs.scala-lang.org/tutorials/FAQ/context-bounds.html#how-are-context-bounds-implemented) this is the t.encoder which passed as a last parameter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171519201 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -116,11 +110,12 @@ class OneHotEncoderEstimatorSuite .setInputCols(Array("index")) .setOutputCols(Array("encoded")) val model = encoder.fit(df) -val output = model.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("0").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("1").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { rows => + val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) + assert(group.size === 2) + assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("0").withIndex(0)) + assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("1").withIndex(1)) +} --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171546765 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -364,18 +419,47 @@ class QuantileDiscretizerSuite .setOutputCols(Array("result1", "result2", "result3")) .setNumBucketsArray(Array(10, 10, 10)) -val result1 = discretizerSingleNumBuckets.fit(df).transform(df) - .select("result1", "result2", "result3") - .collect() -val result2 = discretizerNumBucketsArray.fit(df).transform(df) - .select("result1", "result2", "result3") +val expected = Seq( + (0.0, 0.0, 0.0), + (1.0, 1.0, 1.0), + (1.0, 1.0, 1.0), + (2.0, 2.0, 2.0), + (2.0, 2.0, 2.0), + (2.0, 2.0, 2.0), + (3.0, 3.0, 3.0), + (4.0, 4.0, 4.0), + (4.0, 4.0, 4.0), + (5.0, 5.0, 5.0), + (5.0, 5.0, 5.0), + (5.0, 5.0, 5.0), + (6.0, 6.0, 6.0), + (6.0, 6.0, 6.0), + (7.0, 7.0, 7.0), + (8.0, 8.0, 8.0), + (8.0, 8.0, 8.0), + (9.0, 9.0, 9.0), + (9.0, 9.0, 9.0), + (9.0, 9.0, 9.0) +).toDF("result1", "result2", "result3") --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171546691 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -324,19 +352,46 @@ class QuantileDiscretizerSuite .setStages(Array(discretizerForCol1, discretizerForCol2, discretizerForCol3)) .fit(df) -val resultForMultiCols = plForMultiCols.transform(df) - .select("result1", "result2", "result3") - .collect() - -val resultForSingleCol = plForSingleCol.transform(df) - .select("result1", "result2", "result3") - .collect() +val expected = Seq( + (0.0, 0.0, 0.0), + (0.0, 0.0, 1.0), + (0.0, 0.0, 1.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 2.0), + (0.0, 1.0, 3.0), + (0.0, 2.0, 4.0), + (0.0, 2.0, 4.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 2.0, 5.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 6.0), + (1.0, 3.0, 7.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 8.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0), + (1.0, 4.0, 9.0) + ).toDF("result1", "result2", "result3") +.collect().toSeq --- End diff -- What about use: ``` val expected = plForSingleCol.transform(df).select("result1", "result2", "result3").collect() ``` So that avoid hardcoding the big array. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171517362 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -17,94 +17,72 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Row} -class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class NormalizerSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ - @transient var data: Array[Vector] = _ - @transient var dataFrame: DataFrame = _ - @transient var normalizer: Normalizer = _ - @transient var l1Normalized: Array[Vector] = _ - @transient var l2Normalized: Array[Vector] = _ + @transient val data: Seq[Vector] = Seq( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), +Vectors.sparse(3, Seq())) --- End diff -- I prefer to put initializing `data` var into `beforeAll`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171543393 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite val df = spark.createDataFrame(sc.parallelize(data), schema) -val dfWithTypes = df - .withColumn("shortInput", df("input").cast(ShortType)) - .withColumn("longInput", df("input").cast(LongType)) - .withColumn("intInput", df("input").cast(IntegerType)) - .withColumn("floatInput", df("input").cast(FloatType)) - .withColumn("decimalInput", df("input").cast(DecimalType(10, 0))) - -val cols = Array("input", "shortInput", "longInput", "intInput", - "floatInput", "decimalInput") -for (col <- cols) { - val encoder = new OneHotEncoderEstimator() -.setInputCols(Array(col)) +class NumericTypeWithEncoder[A](val numericType: NumericType) + (implicit val encoder: Encoder[(A, Vector)]) + +val types = Seq( + new NumericTypeWithEncoder[Short](ShortType), + new NumericTypeWithEncoder[Long](LongType), + new NumericTypeWithEncoder[Int](IntegerType), + new NumericTypeWithEncoder[Float](FloatType), + new NumericTypeWithEncoder[Byte](ByteType), + new NumericTypeWithEncoder[Double](DoubleType), + new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) + +for (t <- types) { + val dfWithTypes = df.select(col("input").cast(t.numericType), col("expected")) + val estimator = new OneHotEncoderEstimator() +.setInputCols(Array("input")) .setOutputCols(Array("output")) .setDropLast(false) - val model = encoder.fit(dfWithTypes) - val encoded = model.transform(dfWithTypes) - - encoded.select("output", "expected").rdd.map { r => -(r.getAs[Vector](0), r.getAs[Vector](1)) - }.collect().foreach { case (vec1, vec2) => -assert(vec1 === vec2) - } + val model = estimator.fit(dfWithTypes) + testTransformer(dfWithTypes, model, "output", "expected") { +case Row(output: Vector, expected: Vector) => + assert(output === expected) + }(t.encoder) --- End diff -- Why need `(t.encoder)` here ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171542822 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -151,29 +146,30 @@ class OneHotEncoderEstimatorSuite val df = spark.createDataFrame(sc.parallelize(data), schema) -val dfWithTypes = df - .withColumn("shortInput", df("input").cast(ShortType)) - .withColumn("longInput", df("input").cast(LongType)) - .withColumn("intInput", df("input").cast(IntegerType)) - .withColumn("floatInput", df("input").cast(FloatType)) - .withColumn("decimalInput", df("input").cast(DecimalType(10, 0))) - -val cols = Array("input", "shortInput", "longInput", "intInput", - "floatInput", "decimalInput") -for (col <- cols) { - val encoder = new OneHotEncoderEstimator() -.setInputCols(Array(col)) +class NumericTypeWithEncoder[A](val numericType: NumericType) + (implicit val encoder: Encoder[(A, Vector)]) + +val types = Seq( + new NumericTypeWithEncoder[Short](ShortType), + new NumericTypeWithEncoder[Long](LongType), + new NumericTypeWithEncoder[Int](IntegerType), + new NumericTypeWithEncoder[Float](FloatType), + new NumericTypeWithEncoder[Byte](ByteType), + new NumericTypeWithEncoder[Double](DoubleType), + new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) --- End diff -- Why not use `Seq(ShortType, LongType, ...)` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171519138 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -103,11 +96,12 @@ class OneHotEncoderEstimatorSuite .setInputCols(Array("size")) .setOutputCols(Array("encoded")) val model = encoder.fit(df) -val output = model.transform(df) -val group = AttributeGroup.fromStructField(output.schema("encoded")) -assert(group.size === 2) -assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) -assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +testTransformerByGlobalCheckFunc[(Double)](df, model, "encoded") { rows => +val group = AttributeGroup.fromStructField(rows.head.schema("encoded")) +assert(group.size === 2) +assert(group.getAttr(0) === BinaryAttribute.defaultAttr.withName("small").withIndex(0)) +assert(group.getAttr(1) === BinaryAttribute.defaultAttr.withName("medium").withIndex(1)) +} --- End diff -- I think for streaming , we don't need to test functions about attributes, so this part just keep old testing code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20686#discussion_r171545178 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala --- @@ -119,29 +131,41 @@ class OneHotEncoderSuite test("OneHotEncoder with varying types") { val df = stringIndexed() -val dfWithTypes = df - .withColumn("shortLabel", df("labelIndex").cast(ShortType)) - .withColumn("longLabel", df("labelIndex").cast(LongType)) - .withColumn("intLabel", df("labelIndex").cast(IntegerType)) - .withColumn("floatLabel", df("labelIndex").cast(FloatType)) - .withColumn("decimalLabel", df("labelIndex").cast(DecimalType(10, 0))) -val cols = Array("labelIndex", "shortLabel", "longLabel", "intLabel", - "floatLabel", "decimalLabel") -for (col <- cols) { +val attr = NominalAttribute.defaultAttr.withValues("small", "medium", "large") +val expected = Seq( + (0, Vectors.sparse(3, Seq((0, 1.0, + (1, Vectors.sparse(3, Seq((2, 1.0, + (2, Vectors.sparse(3, Seq((1, 1.0, + (3, Vectors.sparse(3, Seq((0, 1.0, + (4, Vectors.sparse(3, Seq((0, 1.0, + (5, Vectors.sparse(3, Seq((1, 1.0).toDF("id", "expected") + +val withExpected = df.join(expected, "id") + +class NumericTypeWithEncoder[A](val numericType: NumericType) + (implicit val encoder: Encoder[(A, Vector)]) + +val types = Seq( + new NumericTypeWithEncoder[Short](ShortType), + new NumericTypeWithEncoder[Long](LongType), + new NumericTypeWithEncoder[Int](IntegerType), + new NumericTypeWithEncoder[Float](FloatType), + new NumericTypeWithEncoder[Byte](ByteType), + new NumericTypeWithEncoder[Double](DoubleType), + new NumericTypeWithEncoder[Decimal](DecimalType(10, 0))(ExpressionEncoder())) --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml...
GitHub user attilapiros opened a pull request: https://github.com/apache/spark/pull/20686 [SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z # What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite.scala - NormalizerSuite.scala - OneHotEncoderEstimatorSuite.scala - OneHotEncoderSuite.scala - PCASuite.scala - PolynomialExpansionSuite.scala - QuantileDiscretizerSuite.scala - RFormulaSuite.scala - SQLTransformerSuite.scala - StandardScalerSuite.scala - StopWordsRemoverSuite.scala - StringIndexerSuite.scala - TokenizerSuite.scala - VectorIndexerSuite.scala - VectorSizeHintSuite.scala - VectorSlicerSuite.scala - Word2VecSuite.scala # How was this patch tested? They are unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/attilapiros/spark SPARK-22915 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20686.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20686 commit 4099c85e94fabdc4848acf54a9d2704d4f3f5246 Author: âattilapirosâDate: 2018-02-23T17:27:46Z initial upload --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org