[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r132021227
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -107,9 +103,9 @@ class MultilayerPerceptronClassifierSuite
 model.setProbabilityCol("probability")
 MLTestingUtils.checkCopyAndUids(trainer, model)
 // result.select("probability").show(false)
--- End diff --

The result is
```
+--+
|probability   |
+--+
|[0.995713441748,4.2865582522823835E-7]|
|[1.992910055147819E-9,0.80070899] |
|[0.4999458983233704,0.5000541016766296]   |
|[0.4999458983233704,0.5000541016766296]   |
+--+
```
cc @MrBago Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r131992917
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -83,6 +83,36 @@ class MultilayerPerceptronClassifierSuite
 }
   }
 
+  test("strong dataset test") {
+val layers = Array[Int](4, 5, 5, 4)
+
+val rnd = new scala.util.Random(1234L)
+
+val strongDataset = Seq.tabulate(4) { index =>
+  (Vectors.dense(
+rnd.nextGaussian(),
+rnd.nextGaussian() * 2.0,
+rnd.nextGaussian() * 3.0,
+rnd.nextGaussian() * 2.0
+  ), (index % 4).toDouble)
+}.toDF("features", "label")
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(strongDataset)
+val result = model.transform(strongDataset)
+model.setProbabilityCol("probability")
+MLTestingUtils.checkCopyAndUids(trainer, model)
+// result.select("probability").show(false)
+val predictionAndLabels = result.select("prediction", 
"label").collect()
+predictionAndLabels.foreach { case Row(p: Double, l: Double) =>
+  assert(p == l)
+}
+  }
--- End diff --

@MrBago 
How do you like this test ?
The probability it generate is


+--+
|probability
   |

+--+

|[0.9917274999513315,0.001511626318489583,0.004831796668307991,0.0019290770618710876]
  |

|[4.2392735713619E-12,0.99955336,1.8369996605279208E-14,2.0871629225077174E-13]|

|[1.8975708749716946E-4,5.191732707447977E-22,0.5010860788259045,0.49872416408659836]
  |

|[1.6776134471360903E-4,3.9309610969078615E-22,0.49629577580941386,0.5035364628458726]
 |

+--+

it contains some values near 0.5



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r131824713
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -82,6 +83,23 @@ class MultilayerPerceptronClassifierSuite
 }
   }
 
+  test("test model probability") {
+val layers = Array[Int](2, 5, 2)
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(dataset)
+model.setProbabilityCol("probability")
+val result = model.transform(dataset)
+val features2prob = udf { features: Vector => 
model.mlpModel.predict(features) }
+val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 }
+assert(result.select(cmpVec(features2prob(col("features")), 
col("probability")))
+  .rdd.map(_.getBoolean(0)).reduce(_ && _))
+  }
+
--- End diff --

@MrBago 
Which way of the strong test should be done ? Add a test to check the 
probability vector equals given vectors ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r131790673
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]] = {
 // Initialize output arrays for all layers. Special treatment for 
InPlace
--- End diff --

@MrBago In `MultiLayerPerceptronClassifier.train` there is a line:
```
val topology = FeedForwardTopology.multiLayerPerceptron(myLayers, 
softmaxOnTop = true)
```
So MultiLayerPerceptronClassifier always use softmax as the last layer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130746928
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]] = {
 // Initialize output arrays for all layers. Special treatment for 
InPlace
--- End diff --

Could you add the above comment in the code, it could be useful for folks 
reading/editing this in the future.

Also it seems like the last layer could also be a 
SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those 
cases any differently?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130747030
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -363,7 +363,7 @@ private[ann] trait TopologyModel extends Serializable {
* @param data input data
* @return array of outputs for each of the layers
*/
-  def forward(data: BDM[Double]): Array[BDM[Double]]
+  def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]]
--- End diff --

Can you update the docstring for this method to add the argument?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130746665
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]] = {
--- End diff --

Could we use a variable name like`includeLastLayer` here? 
`containsLastLayer` sounds like a property of the model instead of an 
instruction to the method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130747996
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -82,6 +83,23 @@ class MultilayerPerceptronClassifierSuite
 }
   }
 
+  test("test model probability") {
+val layers = Array[Int](2, 5, 2)
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(dataset)
+model.setProbabilityCol("probability")
+val result = model.transform(dataset)
+val features2prob = udf { features: Vector => 
model.mlpModel.predict(features) }
+val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 }
+assert(result.select(cmpVec(features2prob(col("features")), 
col("probability")))
+  .rdd.map(_.getBoolean(0)).reduce(_ && _))
+  }
+
--- End diff --

I think we should include a stronger test for this. I did a quick search 
and couldn't find a strong test for `mlpModel.predict`, it might be good to add 
one. Also, I believe this xor dataset only produces probability predictions 
~equal to 0 or 1.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-07-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r129697649
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]] = {
 // Initialize output arrays for all layers. Special treatment for 
InPlace
--- End diff --

The last layer is always `softmax`, add the `containsLastLayer` parameter, 
when `true` the forward computing will contains last layer, otherwise not. The 
parameter is used when we need `rawPrediction`, the last layer `softmax` should 
discard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-07-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r129697890
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -527,9 +544,21 @@ private[ml] class FeedForwardModel private(
 
   override def predict(data: Vector): Vector = {
 val size = data.size
-val result = forward(new BDM[Double](size, 1, data.toArray))
+val result = forward(new BDM[Double](size, 1, data.toArray), true)
 Vectors.dense(result.last.toArray)
   }
+
+  override def predictRaw(data: Vector): Vector = {
+val size = data.size
--- End diff --

add `predictRaw` method, computing without last layer (softmax)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-07-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r129698281
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -527,9 +544,21 @@ private[ml] class FeedForwardModel private(
 
   override def predict(data: Vector): Vector = {
 val size = data.size
-val result = forward(new BDM[Double](size, 1, data.toArray))
+val result = forward(new BDM[Double](size, 1, data.toArray), true)
 Vectors.dense(result.last.toArray)
   }
+
+  override def predictRaw(data: Vector): Vector = {
+val size = data.size
+val result = forward(new BDM[Double](size, 1, data.toArray), false)
+Vectors.dense(result(result.length - 2).toArray)
+  }
+
+  override def raw2ProbabilityInPlace(data: Vector): Vector = {
+val dataMatrix = new BDM[Double](data.size, 1, data.toArray)
--- End diff --

add `raw2ProbabilityInPlace`, what it compute is:
```
softmax(rawPredictionsVector) ==> predictionsVector
```
directly call the last layer function to compute it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-03-21 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/17373

[SPARK-12664] Expose probability in mlp model

## What changes were proposed in this pull request?

Modify MLP model to inherit `ProbabilisticClassificationModel` and so that 
it can expose the probability  column when transforming data.

## How was this patch tested?

Test added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
expose_probability_in_mlp_model

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17373.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17373


commit 1f0da4e8ebff8509ac3bc6f06004cbecff6356e9
Author: WeichenXu 
Date:   2017-03-20T14:31:14Z

expose probability in mlp model




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org