[GitHub] spark pull request #17407: [SPARK-20043][ML][WIP] DecisionTreeModel can't re...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/17407 [SPARK-20043][ML][WIP] DecisionTreeModel can't recongnize Impurity "Gini" when loading Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading TODO: + [ ] add unit test + [ ] fix the bug You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/decision_tree_loader_failer_with_Gini_impurity Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17407.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17407 commit 2f5477f525319fc323cb63e42337c876592788c1 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-24T04:18:39Z TST: add testsuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator ...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17407 @jkbradley Hi, tests passed. Is it good enough to be merged? By the way, String Params are fragile when saving/loading model, as setParam and getParam methods are useless in such case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17407: [SPARK-20043][ML] DecisionTreeModel can't recongnize Imp...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17407 @jkbradley Thanks. I agree with your advice. Modifying the value is a little aggressive, while changing ImpurityCalculator.getCalculator is more moderate. However, I'm afraid that the similar bugs, because of mismatch, won't leave if we don't fix it at first. By the way, some unit tests broken, through I have tested them before pushing. Perhaps there is some wrong with my IDE, and I need to take a check on next Monday. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17383#discussion_r108318833 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala --- @@ -301,7 +302,7 @@ private[tree] class LearningNode( * group of nodes on one call to * [[org.apache.spark.ml.tree.impl.RandomForest.findBestSplits()]]. */ - def predictImpl(binnedFeatures: Array[Int], splits: Array[Array[Split]]): Int = { + def predictImpl(binnedFeatures: Int => Int, splits: Array[Array[Split]]): Int = { --- End diff -- when use `breeze.linalg.SparseVector`, compile fails: ```scala java.lang.AssertionError: assertion failed: List(method apply$mcI$sp, method apply$mcI$sp) at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1947) at scala.tools.nsc.transform.SpecializeTypes$$anon$2.matchingSymbolInPrefix$1(SpecializeTypes.scala:1474) at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transformSelect$1(SpecializeTypes.scala:1497) at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transform1(SpecializeTypes.scala:1593) at scala.tools.nsc.transform.SpecializeTypes$$anon$2$$anonfun$transform$3.apply(SpecializeTypes.scala:1442) .. ``` I don't know why? Can anyone help? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17383#discussion_r108318997 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala --- @@ -301,7 +302,7 @@ private[tree] class LearningNode( * group of nodes on one call to * [[org.apache.spark.ml.tree.impl.RandomForest.findBestSplits()]]. */ - def predictImpl(binnedFeatures: Array[Int], splits: Array[Array[Split]]): Int = { + def predictImpl(binnedFeatures: Int => Int, splits: Array[Array[Split]]): Int = { --- End diff -- when use `breeze.linalg.SparseVector`, compile fails: ```scala java.lang.AssertionError: assertion failed: List(method apply$mcI$sp, method apply$mcI$sp) at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1947) at scala.tools.nsc.transform.SpecializeTypes$$anon$2.matchingSymbolInPrefix$1(SpecializeTypes.scala:1474) at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transformSelect$1(SpecializeTypes.scala:1497) at scala.tools.nsc.transform.SpecializeTypes$$anon$2.transform1(SpecializeTypes.scala:1593) at scala.tools.nsc.transform.SpecializeTypes$$anon$2$$anonfun$transform$3.apply(SpecializeTypes.scala:1442) .. ``` I don't know why? Can anyone help? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108320537 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("SPARK-20043: " + + "ImpurityCalculator builder fails for uppercase impurity type Gini in model read/write") { +val rdd = TreeTests.getTreeReadWriteData(sc) + +val data: DataFrame = + TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2) + +val dt = new DecisionTreeClassifier() + .setImpurity("Gini") + .setMaxDepth(2) + +val model = dt.fit(data) + +testDefaultReadWrite(model, false) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108320481 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -178,6 +178,22 @@ class DecisionTreeRegressorSuite TreeTests.allParamSettings ++ Map("maxDepth" -> 0), TreeTests.allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("SPARK-20043: " + + "ImpurityCalculator builder fails for uppercase impurity type in model read/write") { +val rdd = TreeTests.getTreeReadWriteData(sc) + +val data: DataFrame = + TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 0) + +val dt = new DecisionTreeRegressor() + .setImpurity("Variance") + .setMaxDepth(2) + +val model = dt.fit(data) + --- End diff -- removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108320525 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("SPARK-20043: " + + "ImpurityCalculator builder fails for uppercase impurity type Gini in model read/write") { +val rdd = TreeTests.getTreeReadWriteData(sc) + +val data: DataFrame = + TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2) + +val dt = new DecisionTreeClassifier() + .setImpurity("Gini") + .setMaxDepth(2) + +val model = dt.fit(data) + --- End diff -- delete some blank lines to keep compact. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108050101 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("read/write: ImpurityCalculator builder did not recognize impurity type: Gini") { +val rdd = TreeTests.getTreeReadWriteData(sc) + +val categoricalData: DataFrame = + TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108050103 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("read/write: ImpurityCalculator builder did not recognize impurity type: Gini") { +val rdd = TreeTests.getTreeReadWriteData(sc) + +val categoricalData: DataFrame = + TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) + +// BUG: see SPARK-20043 --- End diff -- removed the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108050099 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("read/write: ImpurityCalculator builder did not recognize impurity type: Gini") { --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108050110 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("read/write: ImpurityCalculator builder did not recognize impurity type: Gini") { +val rdd = TreeTests.getTreeReadWriteData(sc) + +val categoricalData: DataFrame = + TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) + +// BUG: see SPARK-20043 +val dt = new DecisionTreeClassifier().setImpurity("Gini") --- End diff -- set maxDepth = 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel can't recongn...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17407#discussion_r108050269 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), allParamSettings ++ Map("maxDepth" -> 0), checkModelData) } + + test("SAPRK-20043: " + --- End diff -- oops, copy & paste. thanks for correction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r107077783 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Since} +import org.apache.spark.mllib.tree.impurity._ + +/** + * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) impurity measure + * during tree construction. The main purpose of the class is to have an alternative + * leaf prediction calculation. + * + * Only data with examples each of weight 1.0 is supported. + * + * Class for calculating variance during regression. + */ +@Since("2.1") +private[spark] object ApproxBernoulliImpurity extends Impurity { + + /** + * :: DeveloperApi :: + * information calculation for multiclass classification + * @param counts Array[Double] with counts for each label + * @param totalCount sum of counts for all labels + * @return information value, or 0 if totalCount = 0 + */ + @Since("2.1") + @DeveloperApi + override def calculate(counts: Array[Double], totalCount: Double): Double = +throw new UnsupportedOperationException("ApproxBernoulliImpurity.calculate") + + /** + * :: DeveloperApi :: + * variance calculation + * @param count number of instances + * @param sum sum of labels + * @param sumSquares summation of squares of the labels + * @return information value, or 0 if count = 0 + */ + @Since("2.1") + @DeveloperApi + override def calculate(count: Double, sum: Double, sumSquares: Double): Double = { +Variance.calculate(count, sum, sumSquares) + } +} + +/** + * Class for updating views of a vector of sufficient statistics, + * in order to compute impurity from a sample. + * Note: Instances of this class do not hold the data; they operate on views of the data. + */ +private[spark] class ApproxBernoulliAggregator + extends ImpurityAggregator(statsSize = 4) with Serializable { + + /** + * Update stats for one (node, feature, bin) with the given label. + * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. + * @param offsetStart index of stats for this (node, feature, bin). + */ + def update(allStats: Array[Double], offset: Int, label: Double, instanceWeight: Double): Unit = { +allStats(offset) += instanceWeight +allStats(offset + 1) += instanceWeight * label +allStats(offset + 2) += instanceWeight * label * label +allStats(offset + 3) += instanceWeight * Math.abs(label) + } + + /** + * Get an [[ImpurityCalculator]] for a (node, feature, bin). + * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. + * @param offsetStart index of stats for this (node, feature, bin). + */ + def getCalculator(allStats: Array[Double], offset: Int): ApproxBernoulliCalculator = { +new ApproxBernoulliCalculator(allStats.view(offset, offset + statsSize).toArray) + } +} + +/** + * Stores statistics for one (node, feature, bin) for calculating impurity. + * Unlike [[ImpurityAggregator]], this class stores its own data and is for a specific + * (node, feature, bin). + * @param stats Array of sufficient statistics for a (node, feature, bin). + */ +private[spark] class ApproxBernoulliCalculator(stats: Array[Double]) + extends ImpurityCalculator(stats) { + + require(stats.length == 4, +s"ApproxBernoulliCalculator requires sufficient statistics array stats to be of length 4," + + s" but was given array of length ${stats.length}.") + + /** +
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/17383 [SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data ## What changes were proposed in this pull request? DecisionTree should take advantage of sparse feature vectors. Aggregation over training data could handle the empty/zero-valued data elements more efficiently. ## How was this patch tested? Modifying Inner implementation won't change behavior of DecisionTree module, hence all unit tests before should pass. Some performance benchmark perhaps are need. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17383 commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-22T05:45:58Z CLN: use Vector to implement binnedFeatures in TreePoint --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 @jkbradley @hhbyyh Could you review the PR? thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/17503 [SPARK-3159][MLlib] Check for reducible DecisionTree add canMergeChildren param: find the pairs of leave of the same parent which output the same prediction, and merge them. ## How was this patch tested? 1. [x] add unit test: verify whether implementation is correct. 2. [ ] add unit test: verity whether setCanMergeChildren works. 3. [ ] perhaps we need create a sample which can produce a reducible tree, and test it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark CLN/check_for_reducible_decision_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17503 commit fab2a0e5a3c4db8beeaa78d98253d11e408f3b56 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-31T02:33:38Z TST: create new test suite commit f5d52cce500290165ac7d8bad5aa38041ed21c54 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-31T02:35:26Z TST: helper method for construcing binary tree commit b9248b7ae2e1048d93e44e2d3687c2f9fd286ce8 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-31T07:09:51Z TST: helper method, show tree node info commit be12f4f23a5fd53870bc97b6cbdb8fa0a094f2c1 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-31T07:21:41Z TST: helper method, check if pairs of leave with same prediction exists commit b52420201576610613c782146dc9d6c2dc6ebb0c Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-31T07:28:28Z TST: helper method for modifying nodes commit 98a73f952d1a199cf581cde2636d6dc831ae4ee3 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-31T07:41:46Z ENH: merge the pairs of leave with same prediction of same parent commit 632325d0e0d45d7fe9325686f90dbdc64b149960 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-01T01:10:07Z ENH: add mergeLeave param in Strategy commit 12052958d30d015be537fbd1169da4406869fb3d Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-01T01:18:50Z ENH: support mergeChild when training commit 434c762de76be2f1b4ec939ccba9c2ecb45c1c04 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-01T02:48:37Z ENH: add canMergeChildren param in DecisionTreeParams commit 5162552a8db92283a514e94adeef439f6fb8f80e Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-01T02:54:57Z ENH: add set method in tree classifier commit 21b1a851c89cd0a060720503bdc4a9441155236b Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-01T04:19:07Z ENH: stat: merge counts of each tree commit 25b712a37bbc31cc9b8ff2b6330d79fd437cb17c Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-01T06:16:01Z BUG: depth=0 tree has none of children --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r110385244 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging { // sort distinct values val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray + def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = { +val (preValue, preCount) = pre +val (curValue, curCount) = cru +(preValue * preCount + curValue * curCount) / (preCount + curCount) --- End diff -- I agree with you. fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r110385132 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging { // sort distinct values val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray + def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = { --- End diff -- fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r110385526 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging { // sort distinct values val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray + def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = { +val (preValue, preCount) = pre +val (curValue, curCount) = cru +(preValue * preCount + curValue * curCount) / (preCount + curCount) + } + // if possible splits is not enough or just enough, just return all possible splits val possibleSplits = valueCounts.length - 1 - if (possibleSplits <= numSplits) { -valueCounts.map(_._1).init + if (possibleSplits == 0) { +// constant feature +Array.empty[Double] + + } else if (possibleSplits <= numSplits) { +valueCounts + .sliding(2) + .map{x => weightedMean(x(0), x(1))} --- End diff -- fixed. Do you mean use `scanLeft`? It's a little complicate and obscure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r110386358 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -996,7 +996,7 @@ private[spark] object RandomForest extends Logging { require(metadata.isContinuous(featureIndex), "findSplitsForContinuousFeature can only be used to find splits for a continuous feature.") -val splits = if (featureSamples.isEmpty) { +val splits: Array[Double] = if (featureSamples.isEmpty) { --- End diff -- The code block is too long and has 4 exits. Emphasizing its type perhaps is better to be understand, though `splits` is implied by return type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 is there something wrong with spark CI? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 ``` Test Result (1 failure / +1) org.apache.spark.storage.TopologyAwareBlockReplicationPolicyBehavior.Peers in 2 racks ``` Does anyone know what is this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 @srowen Hi, I forget unit tests in python and R. Where can I find document about creating develop environment? thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 I have ran all unit test case of MLlib in Python. However, I am not familiar with R, and I don't want waste too many time on deploying R's environment. Could CI retest the pr? We can check if some unit tests are still broken. thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r111656240 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -126,9 +138,10 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { Array(3), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) - val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble) + val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5) +.map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) - assert(splits === Array(2.0, 3.0)) + assert(splits === Array(2.0625, 3.5)) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r111656235 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -112,9 +124,9 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { Array(5), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) - val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) + val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) - assert(splits === Array(1.0, 2.0)) + assert(splits === Array(1.8, 2.2)) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r111656245 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -104,6 +104,18 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { assert(splits.distinct.length === splits.length) } +// SPARK-16957: Use weighted midpoints for split values. +{ + val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0, +Map(), Set(), +Array(2), Gini, QuantileStrategy.Sort, +0, 0, 0.0, 0, 0 + ) + val featureSamples = Array(0, 1, 0, 0, 1, 0, 1, 1).map(_.toDouble) + val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) + assert(splits === Array(0.5)) --- End diff -- add new case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 @sethah Perhaps it's hard to compare R with Spark's behavior, since many factors involved. I'd like to read R GBM's code, and verify consistency of both side's design on split criteria. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 many thanks, @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 I scanned split critical of sklearn and xgboost. 1. sklearn count all continuous values and split at mean value. commit 5147fd09c6a063188efde444f47bd006fa5f95f0 sklearn/tree/_splitter.pyx: 484: ```python current.threshold = (Xf[p - 1] + Xf[p]) / 2.0 ``` 2. xgboost: commit 49bdb5c97fccd81b1fdf032eab4599a065c6c4f6 + If all continuous values are used as candidate, it uses mean value. src/tree/updater_colmaker.cc: 555: ```c++ e.best.Update(loss_chg, fid, (fvalue + e.last_fvalue) * 0.5f, d_step == -1); ``` + If continuous feature are quantized, it uses `cut`. I'm not familiar with C++ and update_histmaker.cc is a little complicate, hence I don't know what is `cut` indeed. However, it should be the same with current spark's split critical, I guess. src/tree/updater_histmaker.cc: 194: ```c++ if (best->Update(static_cast(loss_chg), fid, hist.cut[i], false)) { ``` Anyway, weighted mean is more reasonable than mean or cut value in my option. And the PR is trivial enhancement for tree module, and it's not worth to spend much time because of obvious conclusion. However, we will be more confident if more feedback of experts are collected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 @srowen Hi, could you review the PR? The PR is simple, though many code for unit test are added. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 Hi, I has checked R GBM's code and found that: R's gbm uses mean value $(x + y) / 2$, not weighted mean $(c_x * x + c_y * y) / (c_x + c_y)$ described in [JIRA SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957), for split point. 1. code snippet: [gbm-developers/gbm](https://github.com/gbm-developers/gbm) commit a1defa382a629f8b97bf9f552dcd821ee7ac9dac src/node_search.cpp:145: ```c++ else if(cCurrentVarClasses == 0) // variable is continuous { // Evaluate the current split dCurrentSplitValue = 0.5*(dLastXValue + dX); } ``` 2. test To verify it, I create a toy dataset and take a test on R. ```R > f = c(0.0, 0.0, 1.0, 1.0, 1.0, 1.0) > l = c(0, 0, 1, 1, 1, 1) > df = data.frame(l, f) > sapply(df, class) l f "numeric" "numeric" > mod <- gbm(l~f, data=df, n.trees=1, bag.fraction=1, n.minobsinnode=1, distribution = "bernoulli") > pretty.gbm.tree(mod) SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight 00 5.00e-011 2 3 1.33 6 1 -1 -3.00e-03 -1-1 -1 0.00 2 2 -1 1.50e-03 -1-1 -1 0.00 4 3 -1 1.480297e-19 -1-1 -1 0.00 6 Prediction 0 1.480297e-19 1 -3.00e-03 2 1.50e-03 3 1.480297e-19 ``` As expected, the root's split point is 5.00e-01, namely mean value `0.5 = (0 + 1) / 2`, not weighted mean `0.7 = (0 * 2 + 1 * 4) / 6`. 3. conclusion I prefer to using weighted mean for split point in the PR, rather than mean value in R's gbm package. How about you? @sethah @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r105576961 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Since} +import org.apache.spark.mllib.tree.impurity._ + +/** + * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) impurity measure + * during tree construction. The main purpose of the class is to have an alternative + * leaf prediction calculation. + * + * Only data with examples each of weight 1.0 is supported. + * + * Class for calculating variance during regression. + */ +@Since("2.1") +private[spark] object ApproxBernoulliImpurity extends Impurity { + + /** + * :: DeveloperApi :: + * information calculation for multiclass classification + * @param counts Array[Double] with counts for each label + * @param totalCount sum of counts for all labels + * @return information value, or 0 if totalCount = 0 + */ + @Since("2.1") + @DeveloperApi + override def calculate(counts: Array[Double], totalCount: Double): Double = +throw new UnsupportedOperationException("ApproxBernoulliImpurity.calculate") + + /** + * :: DeveloperApi :: + * variance calculation + * @param count number of instances + * @param sum sum of labels + * @param sumSquares summation of squares of the labels + * @return information value, or 0 if count = 0 + */ + @Since("2.1") + @DeveloperApi + override def calculate(count: Double, sum: Double, sumSquares: Double): Double = { +Variance.calculate(count, sum, sumSquares) + } +} + +/** + * Class for updating views of a vector of sufficient statistics, + * in order to compute impurity from a sample. + * Note: Instances of this class do not hold the data; they operate on views of the data. + */ +private[spark] class ApproxBernoulliAggregator + extends ImpurityAggregator(statsSize = 4) with Serializable { + + /** + * Update stats for one (node, feature, bin) with the given label. + * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. + * @param offsetStart index of stats for this (node, feature, bin). + */ + def update(allStats: Array[Double], offset: Int, label: Double, instanceWeight: Double): Unit = { +allStats(offset) += instanceWeight +allStats(offset + 1) += instanceWeight * label +allStats(offset + 2) += instanceWeight * label * label +allStats(offset + 3) += instanceWeight * Math.abs(label) + } + + /** + * Get an [[ImpurityCalculator]] for a (node, feature, bin). + * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. + * @param offsetStart index of stats for this (node, feature, bin). + */ + def getCalculator(allStats: Array[Double], offset: Int): ApproxBernoulliCalculator = { +new ApproxBernoulliCalculator(allStats.view(offset, offset + statsSize).toArray) + } +} + +/** + * Stores statistics for one (node, feature, bin) for calculating impurity. + * Unlike [[ImpurityAggregator]], this class stores its own data and is for a specific + * (node, feature, bin). + * @param stats Array of sufficient statistics for a (node, feature, bin). + */ +private[spark] class ApproxBernoulliCalculator(stats: Array[Double]) + extends ImpurityCalculator(stats) { + + require(stats.length == 4, +s"ApproxBernoulliCalculator requires sufficient statistics array stats to be of length 4," + + s" but was given array of length ${stats.length}.") + + /** +
[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/14547#discussion_r105814881 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Since} +import org.apache.spark.mllib.tree.impurity._ + +/** + * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) impurity measure + * during tree construction. The main purpose of the class is to have an alternative + * leaf prediction calculation. + * + * Only data with examples each of weight 1.0 is supported. + * + * Class for calculating variance during regression. + */ +@Since("2.1") +private[spark] object ApproxBernoulliImpurity extends Impurity { + + /** + * :: DeveloperApi :: + * information calculation for multiclass classification + * @param counts Array[Double] with counts for each label + * @param totalCount sum of counts for all labels + * @return information value, or 0 if totalCount = 0 + */ + @Since("2.1") + @DeveloperApi + override def calculate(counts: Array[Double], totalCount: Double): Double = +throw new UnsupportedOperationException("ApproxBernoulliImpurity.calculate") + + /** + * :: DeveloperApi :: + * variance calculation + * @param count number of instances + * @param sum sum of labels + * @param sumSquares summation of squares of the labels + * @return information value, or 0 if count = 0 + */ + @Since("2.1") + @DeveloperApi + override def calculate(count: Double, sum: Double, sumSquares: Double): Double = { +Variance.calculate(count, sum, sumSquares) + } +} + +/** + * Class for updating views of a vector of sufficient statistics, + * in order to compute impurity from a sample. + * Note: Instances of this class do not hold the data; they operate on views of the data. + */ +private[spark] class ApproxBernoulliAggregator + extends ImpurityAggregator(statsSize = 4) with Serializable { + + /** + * Update stats for one (node, feature, bin) with the given label. + * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. + * @param offsetStart index of stats for this (node, feature, bin). + */ + def update(allStats: Array[Double], offset: Int, label: Double, instanceWeight: Double): Unit = { +allStats(offset) += instanceWeight +allStats(offset + 1) += instanceWeight * label +allStats(offset + 2) += instanceWeight * label * label +allStats(offset + 3) += instanceWeight * Math.abs(label) + } + + /** + * Get an [[ImpurityCalculator]] for a (node, feature, bin). + * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. + * @param offsetStart index of stats for this (node, feature, bin). + */ + def getCalculator(allStats: Array[Double], offset: Int): ApproxBernoulliCalculator = { +new ApproxBernoulliCalculator(allStats.view(offset, offset + statsSize).toArray) + } +} + +/** + * Stores statistics for one (node, feature, bin) for calculating impurity. + * Unlike [[ImpurityAggregator]], this class stores its own data and is for a specific + * (node, feature, bin). + * @param stats Array of sufficient statistics for a (node, feature, bin). + */ +private[spark] class ApproxBernoulliCalculator(stats: Array[Double]) + extends ImpurityCalculator(stats) { + + require(stats.length == 4, +s"ApproxBernoulliCalculator requires sufficient statistics array stats to be of length 4," + + s" but was given array of length ${stats.length}.") + + /** +
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/17556 [SPARK-16957][MLlib] Use weighted midpoints for split values. ## What changes were proposed in this pull request? Use weighted midpoints for split values. ## How was this patch tested? + [x] add unit test. + [x] modify Split's unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/decision_tree_overflow_and_precision_in_aggregation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17556 commit 45b74930eea787411855fc35a7ad7198b35d577e Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-07T04:02:13Z TST: add test case commit c49d3ae7db0e66855b0c896375b11bf51d9ac482 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-07T04:05:36Z ENH: use weighted midpoints commit 387eb498054289149706ecd2f88593d008fd074f Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-07T04:13:44Z BUG: constant feature, outOfIndex commit 2e68f1efca59772d1e905474c2392ad0d8b413c8 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-07T04:15:09Z TST: modify split's test case commit 6a5806f35185596ffda2c88c4879ecaf0be3bda1 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-04-07T04:24:02Z CLN: move test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Thanks, @yanboliang . Could you give a hand, @srowen ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Test failures in pyspark.ml.tests with python2.6, but I don't have the environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 @yanboliang Thanks, yanbo. I am not familar with python 2.6, which is too outdated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Test failures in pyspark.ml.tests with python2.6, but I don't have the environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 Sure, @yanboliang . Thanks for your suggestion. I'll work on it later, perhaps next week. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132618802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) /** @group setParam */ @Since("1.2.0") - def setNumFeatures(value: Int): this.type = set(numFeatures, value) + def setNumFeatures(value: Int): this.type = { +hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) --- End diff -- @WeichenXu123 How about adding `setNumFeatures` method to mllib? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18736 @yanboliang Hi, yangbo. Could you help review the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r126863072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -317,7 +318,12 @@ final class OneVsRest @Since("1.4.0") ( val numClasses = MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity) instr.logNumClasses(numClasses) -val multiclassLabeled = dataset.select($(labelCol), $(featuresCol)) +val multiclassLabeled = getClassifier match { + // SPARK-21306: cache weightCol if necessary + case c: HasWeightCol if c.isDefined(c.weightCol) && c.getWeightCol.nonEmpty => +dataset.select($(labelCol), $(featuresCol), c.getWeightCol) + case _ => dataset.select($(labelCol), $(featuresCol)) +} --- End diff -- Hi, @yanboliang . As @MLnick said, no all classifiers inherits HasWeightCol, so it might cause confusion. In my opinion, `setWeightCol` is an attribute owned by one specific classifier, like `setProbabilityCol` and `setRawPredictionCol` for Logistic Regreesion. So I'd suggest that user should configure the classifier itself, rather than OneVsRest. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126646511 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.util.collection.OpenHashMap /** * Base trait for [[StringIndexer]] and [[StringIndexerModel]]. */ -private[feature] trait StringIndexerBase extends Params with HasInputCol with HasOutputCol { +private[feature] trait StringIndexerBase extends Params with HasHandleInvalid with HasInputCol --- End diff -- Perhaps it is better to break the line at `extends`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126645714 --- Diff: python/pyspark/ml/feature.py --- @@ -3058,26 +3035,37 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM "RFormula drops the same category as R when encoding strings.", typeConverter=TypeConverters.toString) +handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid entries. " + --- End diff -- If I understand correctly, `handleInvalid` has been declared in `HasHandleInvalid` interface, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126643882 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -460,16 +460,16 @@ object LinearRegression extends DefaultParamsReadable[LinearRegression] { val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = WeightedLeastSquares.MAX_NUM_FEATURES /** String name for "auto". */ - private[regression] val AUTO = "auto" + private[regression] val Auto = "auto" --- End diff -- Is `solver` related with `handleInvalid`? It seems a little confused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18582#discussion_r126642928 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType, StructField, StructType} */ @Since("1.4.0") final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String) - extends Model[Bucketizer] with HasInputCol with HasOutputCol with DefaultParamsWritable { + extends Model[Bucketizer] with HasHandleInvalid with HasInputCol with HasOutputCol +with DefaultParamsWritable { --- End diff -- How about aligning `with` with `extends`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 @srowen @yanboliang Could you help review the PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126026388 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour files: Seq[FileStatus]): Option[StructType] = { val libSVMOptions = new LibSVMOptions(options) val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { - // Infers number of features if the user doesn't specify (a valid) one. - val dataFiles = files.filterNot(_.getPath.getName startsWith "_") --- End diff -- Do you mean that _success file of hdfs, if existed, will be safely ignored by spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 @lins05 thanks, reasonable suggestion, I will fix it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126023986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour files: Seq[FileStatus]): Option[StructType] = { val libSVMOptions = new LibSVMOptions(options) val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { - // Infers number of features if the user doesn't specify (a valid) one. - val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + if (files.isEmpty) { throw new IOException("No input path specified for libsvm data") - } else { -throw new IOException("Multiple input paths are not supported for libsvm data.") + } else if (files.length > 1) { +logWarning( + "Multiple input paths detected, determining the number of features by going " + --- End diff -- Good. How about warning only if `numFeature` is missing? I mean, remove the condition `files.length > 1` ```scala if (files.isEmpty) { } else { ... logWarning ...} ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127873828 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") ( val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, +val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, numClasses, $(fitIntercept), + multinomial = isMultinomial)(_) +val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets * numFeatures) { + featuresStd(j / numCoefficientSets) +} else { + 0.0 +} + +val regularization = if (regParamL2 != 0.0) { + val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * numCoefficientSets --- End diff -- It seems that the `regularization` contains `intercept`, right? However, the comment in [LogisticRegression.scala: 1903L](https://github.com/apache/spark/pull/18305/files#diff-3734f1689cb8a80b07974eb93de0795dL1903) is: > // We do not apply regularization to the intercepts --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127874833 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala --- @@ -32,40 +34,45 @@ private[ml] trait DifferentiableRegularization[T] extends DiffFunction[T] { } /** - * A Breeze diff function for computing the L2 regularized loss and gradient of an array of + * A Breeze diff function for computing the L2 regularized loss and gradient of a vector of * coefficients. * * @param regParam The magnitude of the regularization. * @param shouldApply A function (Int => Boolean) indicating whether a given index should have *regularization applied to it. - * @param featuresStd Option indicating whether the regularization should be scaled by the standard - *deviation of the features. + * @param applyFeaturesStd Option for a function which maps coefficient index (column major) to the + * feature standard deviation. If `None`, no standardization is applied. */ private[ml] class L2Regularization( -val regParam: Double, +override val regParam: Double, shouldApply: Int => Boolean, -featuresStd: Option[Array[Double]]) extends DifferentiableRegularization[Array[Double]] { +applyFeaturesStd: Option[Int => Double]) extends DifferentiableRegularization[Vector] { - override def calculate(coefficients: Array[Double]): (Double, Array[Double]) = { -var sum = 0.0 -val gradient = new Array[Double](coefficients.length) -coefficients.indices.filter(shouldApply).foreach { j => - val coef = coefficients(j) - featuresStd match { -case Some(stds) => - val std = stds(j) - if (std != 0.0) { -val temp = coef / (std * std) -sum += coef * temp -gradient(j) = regParam * temp - } else { -0.0 + override def calculate(coefficients: Vector): (Double, Vector) = { +coefficients match { + case dv: DenseVector => +var sum = 0.0 +val gradient = new Array[Double](dv.size) +dv.values.indices.filter(shouldApply).foreach { j => + val coef = coefficients(j) + applyFeaturesStd match { +case Some(getStd) => + val std = getStd(j) + if (std != 0.0) { +val temp = coef / (std * std) +sum += coef * temp +gradient(j) = regParam * temp + } else { +0.0 + } +case None => + sum += coef * coef + gradient(j) = coef * regParam --- End diff -- Trivial, to match `regParam * temp` above, how about using `regParam * coef`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18305#discussion_r127972263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") ( val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) val bcFeaturesStd = instances.context.broadcast(featuresStd) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, +val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, numClasses, $(fitIntercept), + multinomial = isMultinomial)(_) +val getFeaturesStd = (j: Int) => if (j >= 0 && j < numCoefficientSets * numFeatures) { + featuresStd(j / numCoefficientSets) +} else { + 0.0 +} + +val regularization = if (regParamL2 != 0.0) { + val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * numCoefficientSets --- End diff -- Thanks for clarifying. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r128158473 --- Diff: python/pyspark/ml/tests.py --- @@ -1255,6 +1255,17 @@ def test_output_columns(self): output = model.transform(df) self.assertEqual(output.columns, ["label", "features", "prediction"]) +def test_support_for_weightCol(self): --- End diff -- Sure. Use DecisionTreeClassifier to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r129562237 --- Diff: python/pyspark/ml/tests.py --- @@ -1255,6 +1255,24 @@ def test_output_columns(self): output = model.transform(df) self.assertEqual(output.columns, ["label", "features", "prediction"]) +def test_support_for_weightCol(self): +df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8), 1.0), + (1.0, Vectors.sparse(2, [], []), 1.0), + (2.0, Vectors.dense(0.5, 0.5), 1.0)], +["label", "features", "weight"]) +# classifier inherits hasWeightCol +lr = LogisticRegression(maxIter=5, regParam=0.01) +ovr = OneVsRest(classifier=lr, weightCol="weight") +self.assertIsNotNone(ovr.fit(df)) +ovr2 = OneVsRest(classifier=lr).setWeightCol("weight") --- End diff -- cleaned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18554#discussion_r129562189 --- Diff: python/pyspark/ml/classification.py --- @@ -1517,20 +1517,22 @@ class OneVsRest(Estimator, OneVsRestParams, MLReadable, MLWritable): @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", - classifier=None): + weightCol=None, classifier=None): --- End diff -- moved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should support setWeightCol
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 ping @holdenk @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18736 [SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. The PR is a hotfix by storing hashingTF. I believe it is better to migrate native implement from mllib to ml. I can work on it, but I don't know whether to open an new JIRA for migrating or not. ## How was this patch tested? + [x] add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18736.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18736 commit f08861d64390f0eb4d7e34de13d78db67e046457 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-26T05:45:37Z hotfix: store hashingTF to reuse indexOf commit 174a2d57ebeccba10b8f1435970cde69f5829c79 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-26T06:14:37Z TST: add test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18523 [SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type ## What changes were proposed in this pull request? add the column name in the exception which is raised by unsupported data type. ## How was this patch tested? [ ] pass all tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/vectorassembler_add_col Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18523.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18523 commit 95dbf6c7b287d0010af9de377ff6b93dec760808 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-04T05:42:07Z ENH: report the name of missing column --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 @jkbradley May you have time reviewing the pr? I believe that it will be a little improvement for predict. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125398010 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) +val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType)) inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") + case (_, _: NumericType | BooleanType) => + case (_, t) if t.isInstanceOf[VectorUDT] => --- End diff -- Good suggestion. fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 I don't know how to write an unit test for the pr? Is it necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17383 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 Good idea! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...
GitHub user facaiy reopened a pull request: https://github.com/apache/spark/pull/17383 [SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data ## What changes were proposed in this pull request? DecisionTree should take advantage of sparse feature vectors. Aggregation over training data could handle the empty/zero-valued data elements more efficiently. ## How was this patch tested? Modifying Inner implementation won't change behavior of DecisionTree module, hence all unit tests before should pass. Some performance benchmark perhaps are need. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17383 commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-22T05:45:58Z CLN: use Vector to implement binnedFeatures in TreePoint commit 9ce6b813beffb9d58e7b2907425a1262610256be Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-22T09:15:30Z BUG: fix for incompatible argument of predictImpl method commit 37f05f9b0386acc8bea048e72aff2b9c37ca4ca6 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-22T09:18:04Z CLN: create sparse vector when converting to TreePoint commit c9664ce6c94b98cbc76253817e637d9a968e4bd6 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-22T09:21:59Z CLN: change Array to Vector in TreePoint when created commit d6ef9e512ea4a58db2dccf3e7cca95f9e8b0df8f Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-23T02:12:22Z PREP: use Vector[Int] to store binnedFeature commit 59eb779a9d4f711e7b28d31d579cc49e3d3cc370 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-23T03:50:14Z CLN: change binnedFeatures from def to val commit 9cbe577b408e987f3026d01316f5a7f2d4c5cfb2 Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-28T00:57:42Z CLN: use filter to select non-zero bits commit b5b0dc8683b6e2d7d274aa8d39932dec61e6193d Author: é¢åæï¼Yan Facaiï¼ <facai@gmail.com> Date: 2017-03-28T01:03:55Z BUG: fix, compile fails commit cf7e3d8e03f73df725336d0d5a9dd6cc16e7bf95 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-05T05:42:09Z Merge branch 'master' into ENH/use_sparsity_in_decision_tree commit 032d50d8c8a851671ba2754cec817d0f6e9ae70f Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-05T06:20:38Z CLN: use BSV in predictImpl commit 257ddf773eb47499962d6cc57fd1323324dd4ab8 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-05T06:42:24Z ENH: create subclass TreeSparsePoint commit 8a919735f9474283d263df78feb2e176f66917f3 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-05T06:58:54Z ENH: use TreeDensePoint when numFeatures < 1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125584572 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) +val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType)) inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") + case (_, _: NumericType | BooleanType) => + case (_, t) if t.isInstanceOf[VectorUDT] => --- End diff -- Hey, it will throw no found exception if `t: VectorUDT` is used. Do you have any idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18554 [SPARK-21306][ML] OneVsRest should cache weightCol if necessary ## What changes were proposed in this pull request? cache weightCol if classifier inherits HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/oneVsRest_missing_weightCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18554.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18554 commit f1785959e04e462877fd74e6c767e47e0fb207d6 Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-06T09:38:01Z BUG: cache weightCol if necessary commit a3e6e967a23f06dbd50aee08c34b07055c5646ee Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-06T09:55:12Z TST: add unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125860650 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) -inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") +val incorrectColumns = inputColNames.flatMap { name => + schema(name).dataType match { +case _: NumericType | BooleanType => None +case t if t.isInstanceOf[VectorUDT] => None +case other => Some(s"Data type $other of column $name is not supported.") + } +} +if (!incorrectColumns.isEmpty) { --- End diff -- Yes, modified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 @SparkQA Jenkins, run tests again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18523 @SparkQA test again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18554 I'm not familiar with R, and use grep to search "OneVsRest" and get nothing. Hence it seems that nothing is needed to do with R part. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18556#discussion_r126050849 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends TextBasedFileFormat with DataSour files: Seq[FileStatus]): Option[StructType] = { val libSVMOptions = new LibSVMOptions(options) val numFeatures: Int = libSVMOptions.numFeatures.getOrElse { - // Infers number of features if the user doesn't specify (a valid) one. - val dataFiles = files.filterNot(_.getPath.getName startsWith "_") - val path = if (dataFiles.length == 1) { -dataFiles.head.getPath.toUri.toString - } else if (dataFiles.isEmpty) { + if (files.isEmpty) { throw new IOException("No input path specified for libsvm data") --- End diff -- In my opinion, it is safe / necessary to check whether the parameter is valid in advance. Perhaps `IllegalArgumentException` is more suitable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125763918 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) -inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") +val incorrectColumns = inputColNames.flatMap { name => --- End diff -- Hi, yanbo. Use `flatMap` is aimed at unpacking value from Option. Is it OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18523#discussion_r125539040 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") override val uid: String) override def transformSchema(schema: StructType): StructType = { val inputColNames = $(inputCols) val outputColName = $(outputCol) -val inputDataTypes = inputColNames.map(name => schema(name).dataType) +val inputDataTypes = inputColNames.map(name => (name, schema(name).dataType)) inputDataTypes.foreach { - case _: NumericType | BooleanType => - case t if t.isInstanceOf[VectorUDT] => - case other => -throw new IllegalArgumentException(s"Data type $other is not supported.") + case (_, _: NumericType | BooleanType) => + case (_, _: VectorUDT) => + case (name, other) => +throw new IllegalArgumentException(s"Data type $other of column $name is not supported.") --- End diff -- Fine, modified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 @srowen I am not sure whether I understand your question clearly. RandomForest uses LearningNode to construct tree model when training, and convert them to Leaf or InternalNode at last. Hence, all nodes are same type and can be merged when training. However, if two children of a node output same prediction, does the node keep step with its children? I don't know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 fix failed case, please retest it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17503#discussion_r113360409 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala --- @@ -61,6 +61,8 @@ import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity, Variance} * @param subsamplingRate Fraction of the training data used for learning decision tree. * @param useNodeIdCache If this is true, instead of passing trees to executors, the algorithm will * maintain a separate RDD of node Id cache for each row. + * @param canMergeChildren Merge pairs of leaf nodes of the same parent which --- End diff -- A new parameter is added in Strategy class, which fails Mima tests. How to deal with it? ```bash [error] * synthetic method $default$13()Int in object org.apache.spark.mllib.tree.configuration.Strategy has a different result type in current version, where it is Boolean rather than Int ``` [see failed logs](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3675/consoleFull) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 I have the same question with you. I guess that Impurity info is useful to debug and analysis tree model. However, as tree is grown from root to leaf when training, hence it seems needless to merge its sons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r114043563 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging { // sort distinct values val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray - // if possible splits is not enough or just enough, just return all possible splits + def weightedMean(pre: (Double, Int), cur: (Double, Int)): Double = { +val (preValue, preCount) = pre +val (curValue, curCount) = cur +(preValue * preCount + curValue * curCount) / (preCount.toDouble + curCount) + } + val possibleSplits = valueCounts.length - 1 - if (possibleSplits <= numSplits) { -valueCounts.map(_._1).init + if (possibleSplits == 0) { +// constant feature +Array.empty[Double] + --- End diff -- removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r114043568 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -138,9 +169,10 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { Array(2), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) - val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) + val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) - assert(splits === Array(1.0)) + val expSplits = Array((1.0 * 1 + 2.0 * 15) / (1 + 15)) // = (1.9375) --- End diff -- renamed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r114043558 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1037,7 +1051,10 @@ private[spark] object RandomForest extends Logging { // makes the gap between currentCount and targetCount smaller, // previous value is a split threshold. if (previousGap < currentGap) { -splitsBuilder += valueCounts(index - 1)._1 +val pre = valueCounts(index - 1) +val cur = valueCounts(index) + --- End diff -- removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 For a (train) sample of continuous series, say {x0, x1, x2, x3, ..., x100}. Now spark select quantile as split point. Suppose 10-quantiles is used, and x2 is 1st quantile, and x10 is 2nd quantile. It's believed that P(x < x2) ~= P(x2 < x < x10). However, x2 is not perfect. As the data is continuous, there exits one point z is the real point who satisfy P(x < z) == P(z < x < x10). And it's reasonable that averaged midpoint between x2 and x3 is more appropriate, in my option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17556: [SPARK-16957][MLlib] Use weighted midpoints for s...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/17556#discussion_r114043439 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -112,9 +138,11 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { Array(5), Gini, QuantileStrategy.Sort, 0, 0, 0.0, 0, 0 ) - val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) + val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3).map(_.toDouble) val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) - assert(splits === Array(1.0, 2.0)) + val expSplits = Array((1.0 * 2 + 2.0 * 8) / (2 + 8), +(2.0 * 8 + 3.0 * 2) / (8 + 2)) // = (1.8, 2.2) --- End diff -- It may be helpful for code reviewer/maintainer to get the idea clearly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 By the way, it's safe to use mean value as it is match the other libraries. If requested, I'd like to modify the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17556: [SPARK-16957][MLlib] Use weighted midpoints for split va...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17556 OK, weight has been removed when calculating. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r131529693 --- Diff: python/pyspark/ml/classification.py --- @@ -1344,7 +1346,19 @@ def _fit(self, dataset): numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1 -multiclassLabeled = dataset.select(labelCol, featuresCol) +weightCol = None +if (self.isDefined(self.weightCol) and self.getWeightCol()): +if isinstance(classifier, HasWeightCol): +weightCol = self.getWeightCol() +else: +warnings.warn("weightCol is ignored, " + "as it is not supported by {} now.".format( --- End diff -- Thank you very much for help, @yanboliang ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r131529768 --- Diff: python/pyspark/ml/classification.py --- @@ -1423,7 +1425,18 @@ def _fit(self, dataset): numClasses = int(dataset.agg({labelCol: "max"}).head()["max("+labelCol+")"]) + 1 -multiclassLabeled = dataset.select(labelCol, featuresCol) +weightCol = None +if (self.isDefined(self.weightCol) and self.getWeightCol()): +if isinstance(classifier, HasWeightCol): +weightCol = self.getWeightCol() +else: +warnings.warn("weightCol is ignored, " + "as it is not supported by {} now.".format(classifier)) --- End diff -- Modified as suggested, thanks very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18764 [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.0. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/branch-2.0_OneVsRest_support_setWeightCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18764.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18764 commit 0c60c28ca6a3da84401f4947a21ace6980be08fa Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-28T02:10:35Z [SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] OneVsRest should support setWei...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18763 [SPARK-21306][ML] OneVsRest should support setWeightCol for branch-2.1 The PR is related to #18554, and is modified for branch 2.1. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark BUG/branch-2.1_OneVsRest_support_setWeightCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18763 commit ddb83da765f38546a96a63b26478f594601c7e2b Author: Yan Facai (é¢åæ) <facai@gmail.com> Date: 2017-07-28T02:10:35Z [SPARK-21306][ML] OneVsRest should support setWeightCol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130202540 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau } test("SPARK-21306: OneVsRest should support setWeightCol") { -val dataset2 = dataset.withColumn("weight", lit(1)) +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- @yanboliang @srowen My mistake. It will be better to use double value 1.0 in master / branch-2.2 / branch-2.3 in the same way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r130200288 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -33,6 +33,7 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.mllib.util.TestingUtils._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ --- End diff -- add missing import for branch 2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18764#discussion_r130200379 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -143,6 +144,16 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau assert(output.schema.fieldNames.toSet === Set("label", "features", "prediction")) } + test("SPARK-21306: OneVsRest should support setWeightCol") { +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- use double value, `lit(1.0)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch-2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130200461 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -157,6 +157,16 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau assert(output.schema.fieldNames.toSet === Set("label", "features", "prediction")) } + test("SPARK-21306: OneVsRest should support setWeightCol") { +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- use double value, `lit(1.0)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18763#discussion_r130213337 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau } test("SPARK-21306: OneVsRest should support setWeightCol") { -val dataset2 = dataset.withColumn("weight", lit(1)) +val dataset2 = dataset.withColumn("weight", lit(1.0)) --- End diff -- OK. Could you test and merge the PR into branch 2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18764 Sure, thanks, @yanboliang ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18764 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18763 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest should suppo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/18763 Thanks, all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org