[GitHub] spark pull request #17407: [SPARK-20043][ML][WIP] DecisionTreeModel can't re...

2017-03-23 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/17407

[SPARK-20043][ML][WIP] DecisionTreeModel can't recongnize Impurity "Gini" 
when loading

Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading

TODO:
+ [ ] add unit test
+ [ ] fix the bug


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
BUG/decision_tree_loader_failer_with_Gini_impurity

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17407


commit 2f5477f525319fc323cb63e42337c876592788c1
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-24T04:18:39Z

TST: add testsuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator ...

2017-03-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17407
  
@jkbradley Hi, tests passed. Is it good enough to be merged?

By the way, String Params are fragile when saving/loading model, as 
setParam and getParam methods are useless in such case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17407: [SPARK-20043][ML] DecisionTreeModel can't recongnize Imp...

2017-03-25 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17407
  
@jkbradley Thanks. I agree with your advice. Modifying the value is a 
little aggressive, while changing ImpurityCalculator.getCalculator is more 
moderate. However, I'm afraid that the similar bugs, because of mismatch, won't 
leave if we don't fix it at first.

By the way, some unit tests broken, through I have tested them before 
pushing. Perhaps there is some wrong with my IDE, and I need to take a check on 
next Monday.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-03-27 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17383#discussion_r108318833
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
@@ -301,7 +302,7 @@ private[tree] class LearningNode(
* group of nodes on one call to
* 
[[org.apache.spark.ml.tree.impl.RandomForest.findBestSplits()]].
*/
-  def predictImpl(binnedFeatures: Array[Int], splits: 
Array[Array[Split]]): Int = {
+  def predictImpl(binnedFeatures: Int => Int, splits: 
Array[Array[Split]]): Int = {
--- End diff --

when use `breeze.linalg.SparseVector`, compile fails:
```scala
java.lang.AssertionError: assertion failed: List(method apply$mcI$sp, 
method apply$mcI$sp)
at 
scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1947)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2.matchingSymbolInPrefix$1(SpecializeTypes.scala:1474)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2.transformSelect$1(SpecializeTypes.scala:1497)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2.transform1(SpecializeTypes.scala:1593)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2$$anonfun$transform$3.apply(SpecializeTypes.scala:1442)
..
```

I don't know why? Can anyone help? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-03-27 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17383#discussion_r108318997
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
@@ -301,7 +302,7 @@ private[tree] class LearningNode(
* group of nodes on one call to
* 
[[org.apache.spark.ml.tree.impl.RandomForest.findBestSplits()]].
*/
-  def predictImpl(binnedFeatures: Array[Int], splits: 
Array[Array[Split]]): Int = {
+  def predictImpl(binnedFeatures: Int => Int, splits: 
Array[Array[Split]]): Int = {
--- End diff --

when use `breeze.linalg.SparseVector`, compile fails:
```scala
java.lang.AssertionError: assertion failed: List(method apply$mcI$sp, 
method apply$mcI$sp)
at 
scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1947)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2.matchingSymbolInPrefix$1(SpecializeTypes.scala:1474)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2.transformSelect$1(SpecializeTypes.scala:1497)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2.transform1(SpecializeTypes.scala:1593)
at 
scala.tools.nsc.transform.SpecializeTypes$$anon$2$$anonfun$transform$3.apply(SpecializeTypes.scala:1442)
..
```

I don't know why? Can anyone help? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...

2017-03-27 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108320537
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("SPARK-20043: " +
+   "ImpurityCalculator builder fails for uppercase impurity type Gini 
in model read/write") {
+val rdd = TreeTests.getTreeReadWriteData(sc)
+
+val data: DataFrame =
+  TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2)
+
+val dt = new DecisionTreeClassifier()
+  .setImpurity("Gini")
+  .setMaxDepth(2)
+
+val model = dt.fit(data)
+
+testDefaultReadWrite(model, false)
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...

2017-03-27 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108320481
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -178,6 +178,22 @@ class DecisionTreeRegressorSuite
   TreeTests.allParamSettings ++ Map("maxDepth" -> 0),
   TreeTests.allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("SPARK-20043: " +
+   "ImpurityCalculator builder fails for uppercase impurity type in 
model read/write") {
+val rdd = TreeTests.getTreeReadWriteData(sc)
+
+val data: DataFrame =
+  TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 0)
+
+val dt = new DecisionTreeRegressor()
+  .setImpurity("Variance")
+  .setMaxDepth(2)
+
+val model = dt.fit(data)
+
--- End diff --

removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...

2017-03-27 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108320525
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("SPARK-20043: " +
+   "ImpurityCalculator builder fails for uppercase impurity type Gini 
in model read/write") {
+val rdd = TreeTests.getTreeReadWriteData(sc)
+
+val data: DataFrame =
+  TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2)
+
+val dt = new DecisionTreeClassifier()
+  .setImpurity("Gini")
+  .setMaxDepth(2)
+
+val model = dt.fit(data)
+
--- End diff --

delete some blank lines to keep compact.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...

2017-03-25 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108050101
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("read/write: ImpurityCalculator builder did not recognize impurity 
type: Gini") {
+val rdd = TreeTests.getTreeReadWriteData(sc)
+
+val categoricalData: DataFrame =
+  TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...

2017-03-25 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108050103
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("read/write: ImpurityCalculator builder did not recognize impurity 
type: Gini") {
+val rdd = TreeTests.getTreeReadWriteData(sc)
+
+val categoricalData: DataFrame =
+  TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)
+
+// BUG: see SPARK-20043
--- End diff --

removed the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...

2017-03-25 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108050099
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("read/write: ImpurityCalculator builder did not recognize impurity 
type: Gini") {
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...

2017-03-25 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108050110
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("read/write: ImpurityCalculator builder did not recognize impurity 
type: Gini") {
+val rdd = TreeTests.getTreeReadWriteData(sc)
+
+val categoricalData: DataFrame =
+  TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)
+
+// BUG: see SPARK-20043
+val dt = new DecisionTreeClassifier().setImpurity("Gini")
--- End diff --

set maxDepth = 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...

2017-03-25 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17407#discussion_r108050269
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite
 testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ 
Map("maxDepth" -> 0),
   allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
   }
+
+  test("SAPRK-20043: " +
--- End diff --

oops, copy & paste. thanks for correction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2017-03-20 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14547#discussion_r107077783
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
 ---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Since}
+import org.apache.spark.mllib.tree.impurity._
+
+/**
+ * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) 
impurity measure
+ * during tree construction. The main purpose of the class is to have an 
alternative
+ * leaf prediction calculation.
+ *
+ * Only data with examples each of weight 1.0 is supported.
+ *
+ * Class for calculating variance during regression.
+ */
+@Since("2.1")
+private[spark] object ApproxBernoulliImpurity extends Impurity {
+
+  /**
+   * :: DeveloperApi ::
+   * information calculation for multiclass classification
+   * @param counts Array[Double] with counts for each label
+   * @param totalCount sum of counts for all labels
+   * @return information value, or 0 if totalCount = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(counts: Array[Double], totalCount: Double): 
Double =
+throw new 
UnsupportedOperationException("ApproxBernoulliImpurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * variance calculation
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value, or 0 if count = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(count: Double, sum: Double, sumSquares: Double): 
Double = {
+Variance.calculate(count, sum, sumSquares)
+  }
+}
+
+/**
+ * Class for updating views of a vector of sufficient statistics,
+ * in order to compute impurity from a sample.
+ * Note: Instances of this class do not hold the data; they operate on 
views of the data.
+ */
+private[spark] class ApproxBernoulliAggregator
+  extends ImpurityAggregator(statsSize = 4) with Serializable {
+
+  /**
+   * Update stats for one (node, feature, bin) with the given label.
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def update(allStats: Array[Double], offset: Int, label: Double, 
instanceWeight: Double): Unit = {
+allStats(offset) += instanceWeight
+allStats(offset + 1) += instanceWeight * label
+allStats(offset + 2) += instanceWeight * label * label
+allStats(offset + 3) += instanceWeight * Math.abs(label)
+  }
+
+  /**
+   * Get an [[ImpurityCalculator]] for a (node, feature, bin).
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def getCalculator(allStats: Array[Double], offset: Int): 
ApproxBernoulliCalculator = {
+new ApproxBernoulliCalculator(allStats.view(offset, offset + 
statsSize).toArray)
+  }
+}
+
+/**
+ * Stores statistics for one (node, feature, bin) for calculating impurity.
+ * Unlike [[ImpurityAggregator]], this class stores its own data and is 
for a specific
+ * (node, feature, bin).
+ * @param stats  Array of sufficient statistics for a (node, feature, bin).
+ */
+private[spark] class ApproxBernoulliCalculator(stats: Array[Double])
+  extends ImpurityCalculator(stats) {
+
+  require(stats.length == 4,
+s"ApproxBernoulliCalculator requires sufficient statistics array stats 
to be of length 4," +
+  s" but was given array of length ${stats.length}.")
+
+  /**
+

[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-03-22 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/17383

[SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data

## What changes were proposed in this pull request?

DecisionTree should take advantage of sparse feature vectors. Aggregation 
over training data could handle the empty/zero-valued data elements more 
efficiently.


## How was this patch tested?

Modifying Inner implementation won't change behavior of DecisionTree module,
hence all unit tests before should pass.

Some performance benchmark perhaps are need.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17383


commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-22T05:45:58Z

CLN: use Vector to implement binnedFeatures in TreePoint




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-04-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
@jkbradley  @hhbyyh Could you review the PR? thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...

2017-04-01 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/17503

[SPARK-3159][MLlib] Check for reducible DecisionTree

add canMergeChildren param: find the pairs of leave of the same parent 
which output the same prediction, and merge them.

## How was this patch tested?

1. [x] add unit test: verify whether implementation is correct.
2. [ ] add unit test: verity whether setCanMergeChildren works.
3. [ ] perhaps we need create a sample which can produce a reducible tree, 
and test it. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
CLN/check_for_reducible_decision_tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17503


commit fab2a0e5a3c4db8beeaa78d98253d11e408f3b56
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-31T02:33:38Z

TST: create new test suite

commit f5d52cce500290165ac7d8bad5aa38041ed21c54
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-31T02:35:26Z

TST: helper method for construcing binary tree

commit b9248b7ae2e1048d93e44e2d3687c2f9fd286ce8
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-31T07:09:51Z

TST: helper method, show tree node info

commit be12f4f23a5fd53870bc97b6cbdb8fa0a094f2c1
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-31T07:21:41Z

TST: helper method, check if pairs of leave with same prediction exists

commit b52420201576610613c782146dc9d6c2dc6ebb0c
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-31T07:28:28Z

TST: helper method for modifying nodes

commit 98a73f952d1a199cf581cde2636d6dc831ae4ee3
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-31T07:41:46Z

ENH: merge the pairs of leave with same prediction of same parent

commit 632325d0e0d45d7fe9325686f90dbdc64b149960
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-01T01:10:07Z

ENH: add mergeLeave param in Strategy

commit 12052958d30d015be537fbd1169da4406869fb3d
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-01T01:18:50Z

ENH: support mergeChild when training

commit 434c762de76be2f1b4ec939ccba9c2ecb45c1c04
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-01T02:48:37Z

ENH: add canMergeChildren param in DecisionTreeParams

commit 5162552a8db92283a514e94adeef439f6fb8f80e
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-01T02:54:57Z

ENH: add set method in tree classifier

commit 21b1a851c89cd0a060720503bdc4a9441155236b
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-01T04:19:07Z

ENH: stat: merge counts of each tree

commit 25b712a37bbc31cc9b8ff2b6330d79fd437cb17c
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-01T06:16:01Z

BUG: depth=0 tree has none of children




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-07 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r110385244
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging {
   // sort distinct values
   val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray
 
+  def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = {
+val (preValue, preCount) = pre
+val (curValue, curCount) = cru
+(preValue * preCount + curValue * curCount) / (preCount + curCount)
--- End diff --

I agree with you. fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-07 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r110385132
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging {
   // sort distinct values
   val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray
 
+  def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = {
--- End diff --

fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-07 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r110385526
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging {
   // sort distinct values
   val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray
 
+  def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = {
+val (preValue, preCount) = pre
+val (curValue, curCount) = cru
+(preValue * preCount + curValue * curCount) / (preCount + curCount)
+  }
+
   // if possible splits is not enough or just enough, just return all 
possible splits
   val possibleSplits = valueCounts.length - 1
-  if (possibleSplits <= numSplits) {
-valueCounts.map(_._1).init
+  if (possibleSplits == 0) {
+// constant feature
+Array.empty[Double]
+
+  } else if (possibleSplits <= numSplits) {
+valueCounts
+  .sliding(2)
+  .map{x => weightedMean(x(0), x(1))}
--- End diff --

fixed. 
Do you mean use `scanLeft`? It's a little complicate and obscure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-07 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r110386358
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -996,7 +996,7 @@ private[spark] object RandomForest extends Logging {
 require(metadata.isContinuous(featureIndex),
   "findSplitsForContinuousFeature can only be used to find splits for 
a continuous feature.")
 
-val splits = if (featureSamples.isEmpty) {
+val splits: Array[Double] = if (featureSamples.isEmpty) {
--- End diff --

The code block is too long and has 4 exits. Emphasizing its type perhaps is 
better to be understand, though `splits` is implied by return type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
is there something wrong with spark CI? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
```
Test Result (1 failure / +1)

org.apache.spark.storage.TopologyAwareBlockReplicationPolicyBehavior.Peers in 2 
racks
```

Does anyone know what is this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-10 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
@srowen Hi, I forget unit tests in python and R. Where can I find document 
about creating develop environment? thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-12 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
I have ran all unit test case of MLlib in Python. However, I am not 
familiar with R, and I don't want waste too many time on deploying R's 
environment. 

Could CI retest the pr?  We can check if some unit tests are still broken. 
thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-14 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r111656240
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -126,9 +138,10 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Array(3), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
-  val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 
5).map(_.toDouble)
+  val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 3, 4, 5)
+.map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-  assert(splits === Array(2.0, 3.0))
+  assert(splits === Array(2.0625, 3.5))
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-14 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r111656235
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -112,9 +124,9 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Array(5), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
-  val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 
3).map(_.toDouble)
+  val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 
3).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-  assert(splits === Array(1.0, 2.0))
+  assert(splits === Array(1.8, 2.2))
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-14 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r111656245
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -104,6 +104,18 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   assert(splits.distinct.length === splits.length)
 }
 
+// SPARK-16957: Use weighted midpoints for split values.
+{
+  val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
+Map(), Set(),
+Array(2), Gini, QuantileStrategy.Sort,
+0, 0, 0.0, 0, 0
+  )
+  val featureSamples = Array(0, 1, 0, 0, 1, 0, 1, 1).map(_.toDouble)
+  val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+  assert(splits === Array(0.5))
--- End diff --

add new case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-14 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
@sethah Perhaps it's hard to compare R with Spark's behavior, since many 
factors involved. I'd like to read R GBM's code, and verify consistency of both 
side's design on split criteria. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-13 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
many thanks, @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-23 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
I scanned split critical of sklearn and xgboost.

1. sklearn
count all continuous values and split at mean value.

commit 5147fd09c6a063188efde444f47bd006fa5f95f0
sklearn/tree/_splitter.pyx: 484:
```python
current.threshold = (Xf[p - 1] + Xf[p]) / 2.0
```

2. xgboost: 
commit 49bdb5c97fccd81b1fdf032eab4599a065c6c4f6

+ If all continuous values are used as candidate, it uses mean value.

   src/tree/updater_colmaker.cc: 555:
   ```c++
   e.best.Update(loss_chg, fid, (fvalue + e.last_fvalue) * 0.5f, d_step 
== -1);
   ```
+ If continuous feature are quantized, it uses `cut`. I'm not familiar 
with C++ and update_histmaker.cc is a little complicate, hence I don't know 
what is `cut` indeed. However, it should be the same with current spark's split 
critical, I guess.

   src/tree/updater_histmaker.cc: 194:
   ```c++
   if (best->Update(static_cast(loss_chg), fid, hist.cut[i], 
false)) {
   ```

Anyway,  weighted mean is more reasonable than mean or cut value in my 
option.  And the PR is trivial enhancement for tree module, and it's not worth 
to spend much time because of obvious conclusion. 

However, we will be more confident if more feedback of experts are 
collected.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-04-24 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
@srowen Hi, could you review the PR? The PR is simple, though many code for 
unit test are added. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-22 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
Hi, I has checked R GBM's code and found that:
R's gbm uses mean value $(x + y) / 2$,  not weighted mean $(c_x * x + c_y * 
y) / (c_x + c_y)$ described in [JIRA 
SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957), for split 
point.

1. code snippet:
[gbm-developers/gbm](https://github.com/gbm-developers/gbm)
commit a1defa382a629f8b97bf9f552dcd821ee7ac9dac
src/node_search.cpp:145:
```c++
  else if(cCurrentVarClasses == 0)   // variable is continuous
  {
// Evaluate the current split
dCurrentSplitValue = 0.5*(dLastXValue + dX);
  }
```

2. test
To verify it, I create a toy dataset and take a test on R. 
```R
> f = c(0.0, 0.0, 1.0, 1.0, 1.0, 1.0)
> l = c(0,   0,   1,   1,   1,   1)
> df = data.frame(l, f)
> sapply(df, class)
l f
"numeric" "numeric"
> mod <- gbm(l~f, data=df, n.trees=1, bag.fraction=1, n.minobsinnode=1, 
distribution = "bernoulli")
> pretty.gbm.tree(mod)
  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction 
Weight
00  5.00e-011 2   3   1.33  
6
1   -1 -3.00e-03   -1-1  -1   0.00  
2
2   -1  1.50e-03   -1-1  -1   0.00  
4
3   -1  1.480297e-19   -1-1  -1   0.00  
6
 Prediction
0  1.480297e-19
1 -3.00e-03
2  1.50e-03
3  1.480297e-19
```
As expected,
the root's split point is 5.00e-01, namely mean value `0.5 = (0 + 1) / 
2`, not weighted mean `0.7 = (0 * 2 + 1 * 4) / 6`.

3. conclusion
I prefer to using weighted mean for split point in the PR, rather than mean 
value in R's gbm package. How about you? @sethah @srowen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2017-03-12 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14547#discussion_r105576961
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
 ---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Since}
+import org.apache.spark.mllib.tree.impurity._
+
+/**
+ * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) 
impurity measure
+ * during tree construction. The main purpose of the class is to have an 
alternative
+ * leaf prediction calculation.
+ *
+ * Only data with examples each of weight 1.0 is supported.
+ *
+ * Class for calculating variance during regression.
+ */
+@Since("2.1")
+private[spark] object ApproxBernoulliImpurity extends Impurity {
+
+  /**
+   * :: DeveloperApi ::
+   * information calculation for multiclass classification
+   * @param counts Array[Double] with counts for each label
+   * @param totalCount sum of counts for all labels
+   * @return information value, or 0 if totalCount = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(counts: Array[Double], totalCount: Double): 
Double =
+throw new 
UnsupportedOperationException("ApproxBernoulliImpurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * variance calculation
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value, or 0 if count = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(count: Double, sum: Double, sumSquares: Double): 
Double = {
+Variance.calculate(count, sum, sumSquares)
+  }
+}
+
+/**
+ * Class for updating views of a vector of sufficient statistics,
+ * in order to compute impurity from a sample.
+ * Note: Instances of this class do not hold the data; they operate on 
views of the data.
+ */
+private[spark] class ApproxBernoulliAggregator
+  extends ImpurityAggregator(statsSize = 4) with Serializable {
+
+  /**
+   * Update stats for one (node, feature, bin) with the given label.
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def update(allStats: Array[Double], offset: Int, label: Double, 
instanceWeight: Double): Unit = {
+allStats(offset) += instanceWeight
+allStats(offset + 1) += instanceWeight * label
+allStats(offset + 2) += instanceWeight * label * label
+allStats(offset + 3) += instanceWeight * Math.abs(label)
+  }
+
+  /**
+   * Get an [[ImpurityCalculator]] for a (node, feature, bin).
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def getCalculator(allStats: Array[Double], offset: Int): 
ApproxBernoulliCalculator = {
+new ApproxBernoulliCalculator(allStats.view(offset, offset + 
statsSize).toArray)
+  }
+}
+
+/**
+ * Stores statistics for one (node, feature, bin) for calculating impurity.
+ * Unlike [[ImpurityAggregator]], this class stores its own data and is 
for a specific
+ * (node, feature, bin).
+ * @param stats  Array of sufficient statistics for a (node, feature, bin).
+ */
+private[spark] class ApproxBernoulliCalculator(stats: Array[Double])
+  extends ImpurityCalculator(stats) {
+
+  require(stats.length == 4,
+s"ApproxBernoulliCalculator requires sufficient statistics array stats 
to be of length 4," +
+  s" but was given array of length ${stats.length}.")
+
+  /**
+

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2017-03-13 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14547#discussion_r105814881
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
 ---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Since}
+import org.apache.spark.mllib.tree.impurity._
+
+/**
+ * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) 
impurity measure
+ * during tree construction. The main purpose of the class is to have an 
alternative
+ * leaf prediction calculation.
+ *
+ * Only data with examples each of weight 1.0 is supported.
+ *
+ * Class for calculating variance during regression.
+ */
+@Since("2.1")
+private[spark] object ApproxBernoulliImpurity extends Impurity {
+
+  /**
+   * :: DeveloperApi ::
+   * information calculation for multiclass classification
+   * @param counts Array[Double] with counts for each label
+   * @param totalCount sum of counts for all labels
+   * @return information value, or 0 if totalCount = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(counts: Array[Double], totalCount: Double): 
Double =
+throw new 
UnsupportedOperationException("ApproxBernoulliImpurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * variance calculation
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value, or 0 if count = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(count: Double, sum: Double, sumSquares: Double): 
Double = {
+Variance.calculate(count, sum, sumSquares)
+  }
+}
+
+/**
+ * Class for updating views of a vector of sufficient statistics,
+ * in order to compute impurity from a sample.
+ * Note: Instances of this class do not hold the data; they operate on 
views of the data.
+ */
+private[spark] class ApproxBernoulliAggregator
+  extends ImpurityAggregator(statsSize = 4) with Serializable {
+
+  /**
+   * Update stats for one (node, feature, bin) with the given label.
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def update(allStats: Array[Double], offset: Int, label: Double, 
instanceWeight: Double): Unit = {
+allStats(offset) += instanceWeight
+allStats(offset + 1) += instanceWeight * label
+allStats(offset + 2) += instanceWeight * label * label
+allStats(offset + 3) += instanceWeight * Math.abs(label)
+  }
+
+  /**
+   * Get an [[ImpurityCalculator]] for a (node, feature, bin).
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def getCalculator(allStats: Array[Double], offset: Int): 
ApproxBernoulliCalculator = {
+new ApproxBernoulliCalculator(allStats.view(offset, offset + 
statsSize).toArray)
+  }
+}
+
+/**
+ * Stores statistics for one (node, feature, bin) for calculating impurity.
+ * Unlike [[ImpurityAggregator]], this class stores its own data and is 
for a specific
+ * (node, feature, bin).
+ * @param stats  Array of sufficient statistics for a (node, feature, bin).
+ */
+private[spark] class ApproxBernoulliCalculator(stats: Array[Double])
+  extends ImpurityCalculator(stats) {
+
+  require(stats.length == 4,
+s"ApproxBernoulliCalculator requires sufficient statistics array stats 
to be of length 4," +
+  s" but was given array of length ${stats.length}.")
+
+  /**
+

[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-06 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/17556

[SPARK-16957][MLlib] Use weighted midpoints for split values.

## What changes were proposed in this pull request?

Use weighted midpoints for split values.

## How was this patch tested?

+ [x] add unit test.
+ [x] modify Split's unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
ENH/decision_tree_overflow_and_precision_in_aggregation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17556


commit 45b74930eea787411855fc35a7ad7198b35d577e
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-07T04:02:13Z

TST: add test case

commit c49d3ae7db0e66855b0c896375b11bf51d9ac482
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-07T04:05:36Z

ENH: use weighted midpoints

commit 387eb498054289149706ecd2f88593d008fd074f
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-07T04:13:44Z

BUG: constant feature, outOfIndex

commit 2e68f1efca59772d1e905474c2392ad0d8b413c8
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-07T04:15:09Z

TST: modify split's test case

commit 6a5806f35185596ffda2c88c4879ecaf0be3bda1
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-04-07T04:24:02Z

CLN: move test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-07-31 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Thanks, @yanboliang . Could you give a hand, @srowen ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-03 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Test failures in pyspark.ml.tests with python2.6, but I don't have the 
environment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-04 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
@yanboliang Thanks, yanbo. I am not familar with python 2.6, which is too 
outdated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-04 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Test failures in pyspark.ml.tests with python2.6, but I don't have the 
environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2017-08-15 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18736
  
Sure, @yanboliang . Thanks for your suggestion. I'll work on it later, 
perhaps next week. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-10 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132618802
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
 
   /** @group setParam */
   @Since("1.2.0")
-  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
+  def setNumFeatures(value: Int): this.type = {
+hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
--- End diff --

@WeichenXu123  How about adding `setNumFeatures` method to mllib? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2017-08-13 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18736
  
@yanboliang Hi, yangbo. Could you help review the PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r126863072
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -317,7 +318,12 @@ final class OneVsRest @Since("1.4.0") (
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 instr.logNumClasses(numClasses)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val multiclassLabeled = getClassifier match {
+  // SPARK-21306: cache weightCol if necessary
+  case c: HasWeightCol if c.isDefined(c.weightCol) && 
c.getWeightCol.nonEmpty =>
+dataset.select($(labelCol), $(featuresCol), c.getWeightCol)
+  case _ => dataset.select($(labelCol), $(featuresCol))
+}
--- End diff --

Hi, @yanboliang . As @MLnick said, no all classifiers inherits 
HasWeightCol, so it might cause confusion.

In my opinion, `setWeightCol` is an attribute owned by one specific 
classifier, like `setProbabilityCol` and `setRawPredictionCol` for Logistic 
Regreesion. So I'd suggest that user should configure the classifier itself, 
rather than OneVsRest. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126646511
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -36,7 +36,8 @@ import org.apache.spark.util.collection.OpenHashMap
 /**
  * Base trait for [[StringIndexer]] and [[StringIndexerModel]].
  */
-private[feature] trait StringIndexerBase extends Params with HasInputCol 
with HasOutputCol {
+private[feature] trait StringIndexerBase extends Params with 
HasHandleInvalid with HasInputCol
--- End diff --

Perhaps it is better to break the line at `extends`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126645714
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -3058,26 +3035,37 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
"RFormula drops the same category as R 
when encoding strings.",
typeConverter=TypeConverters.toString)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle 
invalid entries. " +
--- End diff --

If I understand correctly, `handleInvalid` has been declared in 
`HasHandleInvalid` interface, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126643882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -460,16 +460,16 @@ object LinearRegression extends 
DefaultParamsReadable[LinearRegression] {
   val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = 
WeightedLeastSquares.MAX_NUM_FEATURES
 
   /** String name for "auto". */
-  private[regression] val AUTO = "auto"
+  private[regression] val Auto = "auto"
--- End diff --

Is `solver` related with `handleInvalid`? It seems a little confused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126642928
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType, 
StructField, StructType}
  */
 @Since("1.4.0")
 final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: 
String)
-  extends Model[Bucketizer] with HasInputCol with HasOutputCol with 
DefaultParamsWritable {
+  extends Model[Bucketizer] with HasHandleInvalid with HasInputCol with 
HasOutputCol
+with DefaultParamsWritable {
--- End diff --

How about aligning `with` with `extends`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-11 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
@srowen @yanboliang Could you help review the PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18556#discussion_r126026388
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
   files: Seq[FileStatus]): Option[StructType] = {
 val libSVMOptions = new LibSVMOptions(options)
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
-  // Infers number of features if the user doesn't specify (a valid) 
one.
-  val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
--- End diff --

Do you mean that _success file of hdfs, if existed, will be safely ignored 
by spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
@lins05 thanks, reasonable suggestion, I will fix it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18556#discussion_r126023986
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
   files: Seq[FileStatus]): Option[StructType] = {
 val libSVMOptions = new LibSVMOptions(options)
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
-  // Infers number of features if the user doesn't specify (a valid) 
one.
-  val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  if (files.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
-  } else {
-throw new IOException("Multiple input paths are not supported for 
libsvm data.")
+  } else if (files.length > 1) {
+logWarning(
+  "Multiple input paths detected, determining the number of 
features by going " +
--- End diff --

Good. How about warning only if `numFeature` is missing? I mean, remove the 
condition `files.length > 1`

```scala
if (files.isEmpty) { }
else { ... logWarning ...}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-17 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127873828
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") (
 val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, 
numClasses, $(fitIntercept),
+  multinomial = isMultinomial)(_)
+val getFeaturesStd = (j: Int) => if (j >= 0 && j < 
numCoefficientSets * numFeatures) {
+  featuresStd(j / numCoefficientSets)
+} else {
+  0.0
+}
+
+val regularization = if (regParamL2 != 0.0) {
+  val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * 
numCoefficientSets
--- End diff --

It seems that the `regularization` contains `intercept`, right?

However, the comment in [LogisticRegression.scala: 
1903L](https://github.com/apache/spark/pull/18305/files#diff-3734f1689cb8a80b07974eb93de0795dL1903)
 is:
> // We do not apply regularization to the intercepts



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-17 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127874833
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
 ---
@@ -32,40 +34,45 @@ private[ml] trait DifferentiableRegularization[T] 
extends DiffFunction[T] {
 }
 
 /**
- * A Breeze diff function for computing the L2 regularized loss and 
gradient of an array of
+ * A Breeze diff function for computing the L2 regularized loss and 
gradient of a vector of
  * coefficients.
  *
  * @param regParam The magnitude of the regularization.
  * @param shouldApply A function (Int => Boolean) indicating whether a 
given index should have
  *regularization applied to it.
- * @param featuresStd Option indicating whether the regularization should 
be scaled by the standard
- *deviation of the features.
+ * @param applyFeaturesStd Option for a function which maps coefficient 
index (column major) to the
+ * feature standard deviation. If `None`, no 
standardization is applied.
  */
 private[ml] class L2Regularization(
-val regParam: Double,
+override val regParam: Double,
 shouldApply: Int => Boolean,
-featuresStd: Option[Array[Double]]) extends 
DifferentiableRegularization[Array[Double]] {
+applyFeaturesStd: Option[Int => Double]) extends 
DifferentiableRegularization[Vector] {
 
-  override def calculate(coefficients: Array[Double]): (Double, 
Array[Double]) = {
-var sum = 0.0
-val gradient = new Array[Double](coefficients.length)
-coefficients.indices.filter(shouldApply).foreach { j =>
-  val coef = coefficients(j)
-  featuresStd match {
-case Some(stds) =>
-  val std = stds(j)
-  if (std != 0.0) {
-val temp = coef / (std * std)
-sum += coef * temp
-gradient(j) = regParam * temp
-  } else {
-0.0
+  override def calculate(coefficients: Vector): (Double, Vector) = {
+coefficients match {
+  case dv: DenseVector =>
+var sum = 0.0
+val gradient = new Array[Double](dv.size)
+dv.values.indices.filter(shouldApply).foreach { j =>
+  val coef = coefficients(j)
+  applyFeaturesStd match {
+case Some(getStd) =>
+  val std = getStd(j)
+  if (std != 0.0) {
+val temp = coef / (std * std)
+sum += coef * temp
+gradient(j) = regParam * temp
+  } else {
+0.0
+  }
+case None =>
+  sum += coef * coef
+  gradient(j) = coef * regParam
--- End diff --

Trivial, to match `regParam * temp` above, how about using `regParam * 
coef`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-18 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127972263
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") (
 val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, 
numClasses, $(fitIntercept),
+  multinomial = isMultinomial)(_)
+val getFeaturesStd = (j: Int) => if (j >= 0 && j < 
numCoefficientSets * numFeatures) {
+  featuresStd(j / numCoefficientSets)
+} else {
+  0.0
+}
+
+val regularization = if (regParamL2 != 0.0) {
+  val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * 
numCoefficientSets
--- End diff --

Thanks for clarifying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-18 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r128158473
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1255,6 +1255,17 @@ def test_output_columns(self):
 output = model.transform(df)
 self.assertEqual(output.columns, ["label", "features", 
"prediction"])
 
+def test_support_for_weightCol(self):
--- End diff --

Sure. Use DecisionTreeClassifier to test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-26 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r129562237
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1255,6 +1255,24 @@ def test_output_columns(self):
 output = model.transform(df)
 self.assertEqual(output.columns, ["label", "features", 
"prediction"])
 
+def test_support_for_weightCol(self):
+df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8), 
1.0),
+ (1.0, Vectors.sparse(2, [], []), 
1.0),
+ (2.0, Vectors.dense(0.5, 0.5), 
1.0)],
+["label", "features", "weight"])
+# classifier inherits hasWeightCol
+lr = LogisticRegression(maxIter=5, regParam=0.01)
+ovr = OneVsRest(classifier=lr, weightCol="weight")
+self.assertIsNotNone(ovr.fit(df))
+ovr2 = OneVsRest(classifier=lr).setWeightCol("weight")
--- End diff --

cleaned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-26 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r129562189
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1517,20 +1517,22 @@ class OneVsRest(Estimator, OneVsRestParams, 
MLReadable, MLWritable):
 
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
- classifier=None):
+ weightCol=None, classifier=None):
--- End diff --

moved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should support setWeightCol

2017-07-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
ping @holdenk @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-07-26 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18736

[SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF

## What changes were proposed in this pull request?

Add indexOf method for ml.feature.HashingTF.

The PR is a hotfix by storing hashingTF.
I believe it is better to migrate native implement from mllib to ml. I can 
work on it, but I don't know whether to open an new JIRA for migrating or not.

## How was this patch tested?

+ [x] add a test case.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18736.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18736


commit f08861d64390f0eb4d7e34de13d78db67e046457
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-26T05:45:37Z

hotfix: store hashingTF to reuse indexOf

commit 174a2d57ebeccba10b8f1435970cde69f5829c79
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-26T06:14:37Z

TST: add test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-03 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18523

[SPARK-21285][ML] VectorAssembler reports the column name of unsupported 
data type

## What changes were proposed in this pull request?
add the column name in the exception which is raised by unsupported data 
type.

## How was this patch tested?
[ ] pass all tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/vectorassembler_add_col

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18523


commit 95dbf6c7b287d0010af9de377ff6b93dec760808
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-04T05:42:07Z

ENH: report the name of missing column




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-07-04 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
@jkbradley  May you have time reviewing the pr? I believe that it will be a 
little improvement for predict. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-04 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125398010
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
+val inputDataTypes = inputColNames.map(name => (name, 
schema(name).dataType))
 inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+  case (_, _: NumericType | BooleanType) =>
+  case (_, t) if t.isInstanceOf[VectorUDT] =>
--- End diff --

Good suggestion. fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
I don't know how to write an unit test for the pr? Is it necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-07-04 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/17383


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
Good idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-07-05 Thread facaiy
GitHub user facaiy reopened a pull request:

https://github.com/apache/spark/pull/17383

[SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data

## What changes were proposed in this pull request?

DecisionTree should take advantage of sparse feature vectors. Aggregation 
over training data could handle the empty/zero-valued data elements more 
efficiently.


## How was this patch tested?

Modifying Inner implementation won't change behavior of DecisionTree module,
hence all unit tests before should pass.

Some performance benchmark perhaps are need.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17383


commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-22T05:45:58Z

CLN: use Vector to implement binnedFeatures in TreePoint

commit 9ce6b813beffb9d58e7b2907425a1262610256be
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-22T09:15:30Z

BUG: fix for incompatible argument of predictImpl method

commit 37f05f9b0386acc8bea048e72aff2b9c37ca4ca6
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-22T09:18:04Z

CLN: create sparse vector when converting to TreePoint

commit c9664ce6c94b98cbc76253817e637d9a968e4bd6
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-22T09:21:59Z

CLN: change Array to Vector in TreePoint when created

commit d6ef9e512ea4a58db2dccf3e7cca95f9e8b0df8f
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-23T02:12:22Z

PREP: use Vector[Int] to store binnedFeature

commit 59eb779a9d4f711e7b28d31d579cc49e3d3cc370
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-23T03:50:14Z

CLN: change binnedFeatures from def to val

commit 9cbe577b408e987f3026d01316f5a7f2d4c5cfb2
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-28T00:57:42Z

CLN: use filter to select non-zero bits

commit b5b0dc8683b6e2d7d274aa8d39932dec61e6193d
Author: 颜发才(Yan Facai) <facai@gmail.com>
Date:   2017-03-28T01:03:55Z

BUG: fix, compile fails

commit cf7e3d8e03f73df725336d0d5a9dd6cc16e7bf95
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-05T05:42:09Z

Merge branch 'master' into ENH/use_sparsity_in_decision_tree

commit 032d50d8c8a851671ba2754cec817d0f6e9ae70f
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-05T06:20:38Z

CLN: use BSV in predictImpl

commit 257ddf773eb47499962d6cc57fd1323324dd4ab8
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-05T06:42:24Z

ENH: create subclass TreeSparsePoint

commit 8a919735f9474283d263df78feb2e176f66917f3
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-05T06:58:54Z

ENH: use TreeDensePoint when numFeatures < 1




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125584572
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
+val inputDataTypes = inputColNames.map(name => (name, 
schema(name).dataType))
 inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+  case (_, _: NumericType | BooleanType) =>
+  case (_, t) if t.isInstanceOf[VectorUDT] =>
--- End diff --

Hey, it will throw no found exception if `t: VectorUDT` is used. Do you 
have any idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...

2017-07-06 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18554

[SPARK-21306][ML] OneVsRest should cache weightCol if necessary

## What changes were proposed in this pull request?

cache weightCol if classifier inherits HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark BUG/oneVsRest_missing_weightCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18554


commit f1785959e04e462877fd74e6c767e47e0fb207d6
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-06T09:38:01Z

BUG: cache weightCol if necessary

commit a3e6e967a23f06dbd50aee08c34b07055c5646ee
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-06T09:55:12Z

TST: add unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125860650
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
-inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+val incorrectColumns = inputColNames.flatMap { name =>
+  schema(name).dataType match {
+case _: NumericType | BooleanType => None
+case t if t.isInstanceOf[VectorUDT] => None
+case other => Some(s"Data type $other of column $name is not 
supported.")
+  }
+}
+if (!incorrectColumns.isEmpty) {
--- End diff --

Yes, modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
@SparkQA Jenkins, run tests again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
@SparkQA test again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
I'm not familiar with R, and use grep to search "OneVsRest" and get 
nothing. Hence it seems that nothing is needed to do with R part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18556#discussion_r126050849
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
   files: Seq[FileStatus]): Option[StructType] = {
 val libSVMOptions = new LibSVMOptions(options)
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
-  // Infers number of features if the user doesn't specify (a valid) 
one.
-  val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  if (files.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
--- End diff --

In my opinion, it is safe / necessary to check whether the parameter is 
valid in advance. Perhaps `IllegalArgumentException` is more suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125763918
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
-inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+val incorrectColumns = inputColNames.flatMap { name =>
--- End diff --

Hi, yanbo. Use `flatMap` is aimed at unpacking value from Option. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-04 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125539040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
+val inputDataTypes = inputColNames.map(name => (name, 
schema(name).dataType))
 inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+  case (_, _: NumericType | BooleanType) =>
+  case (_, _: VectorUDT) =>
+  case (name, other) =>
+throw new IllegalArgumentException(s"Data type $other of column 
$name is not supported.")
--- End diff --

Fine, modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-04-25 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
@srowen  I am not sure whether I understand your question clearly. 
RandomForest uses LearningNode to construct tree model when training, and 
convert them to Leaf or InternalNode at last.  Hence, all nodes are same type 
and can be merged when training. 

However, if two children of a node output same prediction, does the node 
keep step with its children? I don't know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-25 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
fix failed case, please retest it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...

2017-04-25 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17503#discussion_r113360409
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -61,6 +61,8 @@ import org.apache.spark.mllib.tree.impurity.{Entropy, 
Gini, Impurity, Variance}
  * @param subsamplingRate Fraction of the training data used for learning 
decision tree.
  * @param useNodeIdCache If this is true, instead of passing trees to 
executors, the algorithm will
  *   maintain a separate RDD of node Id cache for each 
row.
+ * @param canMergeChildren Merge pairs of leaf nodes of the same parent 
which
--- End diff --

A new parameter is added in Strategy class, which fails Mima tests. How to 
deal with it?

```bash
[error]  * synthetic method $default$13()Int in object 
org.apache.spark.mllib.tree.configuration.Strategy has a different result type 
in current version, where it is Boolean rather than Int
```
[see failed 
logs](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3675/consoleFull)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-04-27 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
I have the same question with you. I guess that Impurity info is useful to 
debug and analysis tree model. However, as tree is grown from root to leaf when 
training, hence it seems needless to merge its sons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r114043563
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging {
   // sort distinct values
   val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray
 
-  // if possible splits is not enough or just enough, just return all 
possible splits
+  def weightedMean(pre: (Double, Int), cur: (Double, Int)): Double = {
+val (preValue, preCount) = pre
+val (curValue, curCount) = cur
+(preValue * preCount + curValue * curCount) / (preCount.toDouble + 
curCount)
+  }
+
   val possibleSplits = valueCounts.length - 1
-  if (possibleSplits <= numSplits) {
-valueCounts.map(_._1).init
+  if (possibleSplits == 0) {
+// constant feature
+Array.empty[Double]
+
--- End diff --

removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r114043568
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -138,9 +169,10 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Array(2), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
-  val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2).map(_.toDouble)
+  val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-  assert(splits === Array(1.0))
+  val expSplits = Array((1.0 * 1 + 2.0 * 15) / (1 + 15))  // = (1.9375)
--- End diff --

renamed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r114043558
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1037,7 +1051,10 @@ private[spark] object RandomForest extends Logging {
   // makes the gap between currentCount and targetCount smaller,
   // previous value is a split threshold.
   if (previousGap < currentGap) {
-splitsBuilder += valueCounts(index - 1)._1
+val pre = valueCounts(index - 1)
+val cur = valueCounts(index)
+
--- End diff --

removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-28 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
For a (train) sample of continuous series, say {x0, x1, x2, x3, ..., x100}. 
Now spark select quantile as split point. 

Suppose 10-quantiles is used, and x2 is 1st quantile, and x10 is 2nd 
quantile. It's believed that P(x < x2) ~= P(x2 < x < x10). However, x2 is not 
perfect. As the data is continuous, there exits one point z is the real point 
who satisfy P(x < z) == P(z < x < x10).

And it's reasonable that averaged midpoint between x2 and x3 is more 
appropriate, in my option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...

2017-04-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17556#discussion_r114043439
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -112,9 +138,11 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Array(5), Gini, QuantileStrategy.Sort,
 0, 0, 0.0, 0, 0
   )
-  val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 
3).map(_.toDouble)
+  val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 
3).map(_.toDouble)
   val splits = 
RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-  assert(splits === Array(1.0, 2.0))
+  val expSplits = Array((1.0 * 2 + 2.0 * 8) / (2 + 8),
+(2.0 * 8 + 3.0 * 2) / (8 + 2)) // = (1.8, 2.2)
--- End diff --

It may be helpful for code reviewer/maintainer to get the idea clearly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-28 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
By the way, it's safe to use mean value as it is match the other libraries. 
If requested, I'd like to modify the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...

2017-04-30 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17556
  
OK, weight has been removed when calculating.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-08-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18764#discussion_r131529693
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1344,7 +1346,19 @@ def _fit(self, dataset):
 
 numClasses = int(dataset.agg({labelCol: 
"max"}).head()["max("+labelCol+")"]) + 1
 
-multiclassLabeled = dataset.select(labelCol, featuresCol)
+weightCol = None
+if (self.isDefined(self.weightCol) and self.getWeightCol()):
+if isinstance(classifier, HasWeightCol):
+weightCol = self.getWeightCol()
+else:
+warnings.warn("weightCol is ignored, "
+  "as it is not supported by {} now.".format(
--- End diff --

Thank you very much for help, @yanboliang !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-08-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r131529768
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1423,7 +1425,18 @@ def _fit(self, dataset):
 
 numClasses = int(dataset.agg({labelCol: 
"max"}).head()["max("+labelCol+")"]) + 1
 
-multiclassLabeled = dataset.select(labelCol, featuresCol)
+weightCol = None
+if (self.isDefined(self.weightCol) and self.getWeightCol()):
+if isinstance(classifier, HasWeightCol):
+weightCol = self.getWeightCol()
+else:
+warnings.warn("weightCol is ignored, "
+  "as it is not supported by {} 
now.".format(classifier))
--- End diff --

Modified as suggested, thanks very much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18764

[SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol

The PR is related to #18554, and is modified for branch 2.0.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
BUG/branch-2.0_OneVsRest_support_setWeightCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18764


commit 0c60c28ca6a3da84401f4947a21ace6980be08fa
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-28T02:10:35Z

[SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-28 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18763

 [SPARK-21306][ML] OneVsRest should support setWeightCol for branch-2.1

The PR is related to #18554, and is modified for branch 2.1.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
BUG/branch-2.1_OneVsRest_support_setWeightCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18763


commit ddb83da765f38546a96a63b26478f594601c7e2b
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-07-28T02:10:35Z

[SPARK-21306][ML] OneVsRest should support setWeightCol




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r130202540
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   }
 
   test("SPARK-21306: OneVsRest should support setWeightCol") {
-val dataset2 = dataset.withColumn("weight", lit(1))
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

@yanboliang @srowen  My mistake. It will be better to use double value 1.0 
in master / branch-2.2 / branch-2.3 in the same way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18764#discussion_r130200288
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -33,6 +33,7 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.mllib.util.TestingUtils._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
--- End diff --

add missing import for branch 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18764#discussion_r130200379
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -143,6 +144,16 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(output.schema.fieldNames.toSet === Set("label", "features", 
"prediction"))
   }
 
+  test("SPARK-21306: OneVsRest should support setWeightCol") {
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

use double value, `lit(1.0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch-2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r130200461
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -157,6 +157,16 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(output.schema.fieldNames.toSet === Set("label", "features", 
"prediction"))
   }
 
+  test("SPARK-21306: OneVsRest should support setWeightCol") {
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

use double value, `lit(1.0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r130213337
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   }
 
   test("SPARK-21306: OneVsRest should support setWeightCol") {
-val dataset2 = dataset.withColumn("weight", lit(1))
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

OK. Could you test and merge the PR into branch 2.1? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-08 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Sure, thanks, @yanboliang !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-08-08 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18764


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-08-08 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18763


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest should suppo...

2017-08-08 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18763
  
Thanks, all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >