[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r132021227 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -107,9 +103,9 @@ class MultilayerPerceptronClassifierSuite model.setProbabilityCol("probability") MLTestingUtils.checkCopyAndUids(trainer, model) // result.select("probability").show(false) --- End diff -- The result is ``` +--+ |probability | +--+ |[0.995713441748,4.2865582522823835E-7]| |[1.992910055147819E-9,0.80070899] | |[0.4999458983233704,0.5000541016766296] | |[0.4999458983233704,0.5000541016766296] | +--+ ``` cc @MrBago Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r131992917 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -83,6 +83,36 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 4) + +val rnd = new scala.util.Random(1234L) + +val strongDataset = Seq.tabulate(4) { index => + (Vectors.dense( +rnd.nextGaussian(), +rnd.nextGaussian() * 2.0, +rnd.nextGaussian() * 3.0, +rnd.nextGaussian() * 2.0 + ), (index % 4).toDouble) +}.toDF("features", "label") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) +// result.select("probability").show(false) +val predictionAndLabels = result.select("prediction", "label").collect() +predictionAndLabels.foreach { case Row(p: Double, l: Double) => + assert(p == l) +} + } --- End diff -- @MrBago How do you like this test ? The probability it generate is +--+ |probability | +--+ |[0.9917274999513315,0.001511626318489583,0.004831796668307991,0.0019290770618710876] | |[4.2392735713619E-12,0.99955336,1.8369996605279208E-14,2.0871629225077174E-13]| |[1.8975708749716946E-4,5.191732707447977E-22,0.5010860788259045,0.49872416408659836] | |[1.6776134471360903E-4,3.9309610969078615E-22,0.49629577580941386,0.5035364628458726] | +--+ it contains some values near 0.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r131824713 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,23 @@ class MultilayerPerceptronClassifierSuite } } + test("test model probability") { +val layers = Array[Int](2, 5, 2) +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(dataset) +model.setProbabilityCol("probability") +val result = model.transform(dataset) +val features2prob = udf { features: Vector => model.mlpModel.predict(features) } +val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 } +assert(result.select(cmpVec(features2prob(col("features")), col("probability"))) + .rdd.map(_.getBoolean(0)).reduce(_ && _)) + } + --- End diff -- @MrBago Which way of the strong test should be done ? Add a test to check the probability vector equals given vectors ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r131790673 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( private var outputs: Array[BDM[Double]] = null private var deltas: Array[BDM[Double]] = null - override def forward(data: BDM[Double]): Array[BDM[Double]] = { + override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { // Initialize output arrays for all layers. Special treatment for InPlace --- End diff -- @MrBago In `MultiLayerPerceptronClassifier.train` there is a line: ``` val topology = FeedForwardTopology.multiLayerPerceptron(myLayers, softmaxOnTop = true) ``` So MultiLayerPerceptronClassifier always use softmax as the last layer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r130746928 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( private var outputs: Array[BDM[Double]] = null private var deltas: Array[BDM[Double]] = null - override def forward(data: BDM[Double]): Array[BDM[Double]] = { + override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { // Initialize output arrays for all layers. Special treatment for InPlace --- End diff -- Could you add the above comment in the code, it could be useful for folks reading/editing this in the future. Also it seems like the last layer could also be a SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those cases any differently? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r130747030 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -363,7 +363,7 @@ private[ann] trait TopologyModel extends Serializable { * @param data input data * @return array of outputs for each of the layers */ - def forward(data: BDM[Double]): Array[BDM[Double]] + def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] --- End diff -- Can you update the docstring for this method to add the argument? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r130746665 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( private var outputs: Array[BDM[Double]] = null private var deltas: Array[BDM[Double]] = null - override def forward(data: BDM[Double]): Array[BDM[Double]] = { + override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { --- End diff -- Could we use a variable name like`includeLastLayer` here? `containsLastLayer` sounds like a property of the model instead of an instruction to the method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r130747996 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,23 @@ class MultilayerPerceptronClassifierSuite } } + test("test model probability") { +val layers = Array[Int](2, 5, 2) +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(dataset) +model.setProbabilityCol("probability") +val result = model.transform(dataset) +val features2prob = udf { features: Vector => model.mlpModel.predict(features) } +val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 } +assert(result.select(cmpVec(features2prob(col("features")), col("probability"))) + .rdd.map(_.getBoolean(0)).reduce(_ && _)) + } + --- End diff -- I think we should include a stronger test for this. I did a quick search and couldn't find a strong test for `mlpModel.predict`, it might be good to add one. Also, I believe this xor dataset only produces probability predictions ~equal to 0 or 1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r129697649 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( private var outputs: Array[BDM[Double]] = null private var deltas: Array[BDM[Double]] = null - override def forward(data: BDM[Double]): Array[BDM[Double]] = { + override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { // Initialize output arrays for all layers. Special treatment for InPlace --- End diff -- The last layer is always `softmax`, add the `containsLastLayer` parameter, when `true` the forward computing will contains last layer, otherwise not. The parameter is used when we need `rawPrediction`, the last layer `softmax` should discard. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r129697890 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -527,9 +544,21 @@ private[ml] class FeedForwardModel private( override def predict(data: Vector): Vector = { val size = data.size -val result = forward(new BDM[Double](size, 1, data.toArray)) +val result = forward(new BDM[Double](size, 1, data.toArray), true) Vectors.dense(result.last.toArray) } + + override def predictRaw(data: Vector): Vector = { +val size = data.size --- End diff -- add `predictRaw` method, computing without last layer (softmax) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r129698281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -527,9 +544,21 @@ private[ml] class FeedForwardModel private( override def predict(data: Vector): Vector = { val size = data.size -val result = forward(new BDM[Double](size, 1, data.toArray)) +val result = forward(new BDM[Double](size, 1, data.toArray), true) Vectors.dense(result.last.toArray) } + + override def predictRaw(data: Vector): Vector = { +val size = data.size +val result = forward(new BDM[Double](size, 1, data.toArray), false) +Vectors.dense(result(result.length - 2).toArray) + } + + override def raw2ProbabilityInPlace(data: Vector): Vector = { +val dataMatrix = new BDM[Double](data.size, 1, data.toArray) --- End diff -- add `raw2ProbabilityInPlace`, what it compute is: ``` softmax(rawPredictionsVector) ==> predictionsVector ``` directly call the last layer function to compute it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/17373 [SPARK-12664] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column when transforming data. ## How was this patch tested? Test added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark expose_probability_in_mlp_model Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17373.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17373 commit 1f0da4e8ebff8509ac3bc6f06004cbecff6356e9 Author: WeichenXuDate: 2017-03-20T14:31:14Z expose probability in mlp model --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org