[GitHub] spark issue #19318: [WIP][SPARK-22096][ML] use aggregateByKeyLocally in feat...

2017-10-11 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/19318 thanks :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...

2017-09-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/18936 Hi Sean, sorry for late reply. Yeah, actually we do have some performance data on F2J vs. OpenBLAS. It seems there is no performance gain from openblas, not even on the unit test level. We

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/19317 Nice catch. thanks. the perf gain is truly narrow. I believe this impl just tried to align with the impl of 'reduceByKeyLocally'. @ConeyLiu maybe we should revisit the code, along

[GitHub] spark pull request #19318: [SPARK-22096][ML] use aggregateByKeyLocally in fe...

2017-09-21 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/19318 [SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calc… ## What changes were proposed in this pull request? NaiveBayes currently takes aggreateByKey followed

[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...

2017-08-18 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/18936 Okay. We will benchmark on OpenBLAS. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...

2017-08-17 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/18936 @srowen currently, what we see is, with default thread setting(take up all computation resource available) for native blas, the No. 1 hot spot (with 95%+ self time

[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...

2017-08-16 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/18936 thanks, Sean and Nick. To @srowen , I think the difference is the finding from our previous investigation that, thread setting in the native BLAS impacts the overall performance of a method

[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...

2017-08-14 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/18936 Yes, they are not the only place, but we only tested on the dense dataset and got the performance data shown above. We are conservative on sparse data, so keep the sparse path the way

[GitHub] spark pull request #18936: [SPARK-21688][ML][MLLIB] make native BLAS the fir...

2017-08-14 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/18936 [SPARK-21688][ML][MLLIB] make native BLAS the first choice for BLAS level 1 operations for dense data ## What changes were proposed in this pull request? In this PR, we make native BLAS

[GitHub] spark issue #17894: [WIP][SPARK-17134][ML] Use level 2 BLAS operations in Lo...

2017-06-01 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/17894 @sethah yes, we only take 100 samples and trained with 3 iterations, numClasss is 20 of our test dataset for single node testing. Yeah, I also believe it'd have a better result if it's

[GitHub] spark issue #17894: [WIP][SPARK-17134][ML] Use level 2 BLAS operations in Lo...

2017-06-01 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/17894 Forgot to mention, we observed a nearly 2x performance gain with the help of nativeBLAS- MKL, without a fine tuning, so if we can also make F2J version run faster in distributed cluster than

[GitHub] spark issue #17894: [WIP][SPARK-17134][ML] Use level 2 BLAS operations in Lo...

2017-06-01 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/17894 sorry for late update! we tested on this PR against the current implementation with both dense and sparse(0.95 sparsity): ![image](https://cloud.githubusercontent.com/assets/2673819

[GitHub] spark issue #17894: [WIP][SPARK-17134][ML] Use level 2 BLAS operations in Lo...

2017-05-16 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/17894 @sethah Sorry for the late response. Setting as WIP. We have performance data for dense features, data for the sparse feature will be ready soon. thanks. --- If your project is set up

[GitHub] spark issue #17894: [SPARK-17134][ML] Use level 2 BLAS operations in Logisti...

2017-05-09 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/17894 @hhbyyh performance testing is ongoing, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #17894: [SPARK-17134][ML] Use level 2 BLAS operations in ...

2017-05-09 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/17894#discussion_r115415823 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1722,25 +1723,22 @@ private class LogisticAggregator

[GitHub] spark pull request #17894: [SPARK-17134][ML] Use level 2 BLAS operations in ...

2017-05-09 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/17894#discussion_r115415580 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -23,6 +23,7 @@ import scala.collection.mutable

[GitHub] spark pull request #17237: [SPARK-19852][PYSPARK][ML] Update Python API setH...

2017-05-08 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/17237#discussion_r115186158 --- Diff: python/pyspark/ml/feature.py --- @@ -1936,6 +1935,14 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid

[GitHub] spark pull request #17894: [SPARK-17134][ML] Use level 2 BLAS operations in ...

2017-05-07 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/17894 [SPARK-17134][ML] Use level 2 BLAS operations in LogisticAggregator ## What changes were proposed in this pull request? Multinomial logistic regression uses LogisticAggregator class

[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-14 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/17237 Sure. No problem! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #17237: [SPARK-19852][PYSPARK][ML] Update Python API setH...

2017-03-09 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/17237 [SPARK-19852][PYSPARK][ML] Update Python API setHandleInvalid for StringIndexer ## What changes were proposed in this pull request? This PR is to maintain API parity with changes made

[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...

2017-03-07 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/16883 Sure, I can work on that :) @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...

2017-03-06 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/16883 updated. Thank you both @imatiach-msft @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...

2017-02-28 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r103599555 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -17,14 +17,16 @@ package org.apache.spark.ml.feature

[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...

2017-02-28 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r103597822 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +190,28 @@ class StringIndexerModel

[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...

2017-02-28 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/16883 gotcha, will update soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #16922: [SPARK-19590][pyspark][ML] Update the document fo...

2017-02-14 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/16922#discussion_r101183452 --- Diff: python/pyspark/ml/feature.py --- @@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab

[GitHub] spark pull request #16922: [SPARK-19590][pyspark][ML] update the document fo...

2017-02-13 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/16922 [SPARK-19590][pyspark][ML] update the document for QuantileDiscretize… ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer

[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...

2017-02-13 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/16883 @srowen @jkbradley do u have time to take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #16883: [SPARK-17498][ML] enchance StringIndexer to handl...

2017-02-09 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/16883 [SPARK-17498][ML] enchance StringIndexer to handle unseen labels ## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR

[GitHub] spark pull request #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spa...

2016-11-17 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15055#discussion_r88427044 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala --- @@ -34,6 +34,7 @@ import org.apache.spark.rdd.RDD import

[GitHub] spark pull request #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spa...

2016-11-16 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15055#discussion_r88376643 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala --- @@ -34,6 +34,7 @@ import org.apache.spark.rdd.RDD import

[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...

2016-11-15 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/15055 @srowen @jkbradley do you have time to take a look at this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add groupKFold to CrossValidator

2016-10-30 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14640 @rdelassus Agree. There are a number of folding methods, so some code refractoring should be done if more folding methods are to be supported in the future. But for now, I guess we will just

[GitHub] spark issue #15428: [SPARK-17219][ML] enhanced NaN value handling in Bucketi...

2016-10-25 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/15428 sorry, I must have forgotten to commit the changes. All done now. Thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #15428: [SPARK-17219][ML] enhanced NaN value handling in Bucketi...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/15428 Thanks for your valuable suggestions. @jkbradley @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84205467 --- Diff: python/pyspark/ml/feature.py --- @@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84205458 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -66,11 +67,13 @@ private[feature] trait

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-19 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84205414 --- Diff: docs/ml-features.md --- @@ -1104,9 +1104,11 @@ for more details on the API. `QuantileDiscretizer` takes a column with continuous features

[GitHub] spark issue #15428: [SPARK-17219][ML] enhanced NaN value handling in Bucketi...

2016-10-17 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/15428 typo corrected. Thank you all. @srowen @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...

2016-10-11 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r82743072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Si

[GitHub] spark pull request #15428: [SPARK-17219][ML] enchanced NaN value handling in...

2016-10-11 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/15428 [SPARK-17219][ML] enchanced NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-13 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r78513514 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -109,7 +114,7 @@ final class QuantileDiscretizer @Since

[GitHub] spark issue #14858: [SPARK-17219][ML] Add NaN value handling in Bucketizer

2016-09-13 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14858 @srowen Updated. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add groupKFold to CrossValidator

2016-09-13 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14640 @finleyb indeed, thank you for pointing it out. I have put it right and added a test to guard this issue. Many thanks. And feel free to let us know if you have any problem with this class or any

[GitHub] spark pull request #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spa...

2016-09-12 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/15055 [SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings ## What changes were proposed in this pull request? Several places in MLlib use custom regexes or other approaches

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-04 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r77465278 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r77283036 --- Diff: docs/ml-features.md --- @@ -1102,7 +1102,8 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r77145656 --- Diff: docs/ml-features.md --- @@ -1102,7 +1102,8 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r77138983 --- Diff: docs/ml-features.md --- @@ -1102,7 +1102,8 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r77138037 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r77134887 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add labelKFold to CrossValidator

2016-09-01 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14640 Updates: 1. code refactoring. Rename the API to align with Sklearn changes 2. add implementation in CrossValidator --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #14858: [SPARK-17219][ML] Add NaN value handling in Bucketizer

2016-08-31 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14858 updated tests and documents related to this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76773626 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76738479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -106,18 +106,19 @@ final class Bucketizer @Since("1.4.0"

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-30 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76738333 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76572410 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76571166 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Si

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76570646 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76569942 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Si

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request: https://github.com/apache/spark/pull/14858#discussion_r76569900 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -129,17 +129,21 @@ object Bucketizer extends DefaultParamsReadable

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

2016-08-29 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/14858 [SPARK-17219][ML] Add NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR fixes an issue when a cutpoints vector containing NaN is sent

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add labelKFold to CrossValidator

2016-08-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14640 @holdenk thanks for your comments. :) You are right. But as you can see, this is a variant of kFold, so I think it's better to stay close to it, otherwise, it would seems confusing, dont you

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add labelKFold to CrossValidator

2016-08-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14640 if one understands the underlying ideas behind this method (labelKFold), it's easy to take it as a class/category of data, though I do think it's not that straightforward, even a bit confusing

[GitHub] spark issue #14747: [SPARK-17086][ML] Fix an issue in QuantileDiscretizer

2016-08-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14747 it seems Array.distinct will not break the sequence of the elements. But, you are right, we need guarantee the array is sorted. --- If your project is set up for it, you can reply

[GitHub] spark issue #14747: [SPARK-17086][ML] Fix an issue in QuantileDiscretizer

2016-08-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/14747 yes, the output from approxQuantile is a sorted array. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #14747: [SPARK-17086] Fix an issue in QuantileDiscretizer

2016-08-22 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/14747 [SPARK-17086] Fix an issue in QuantileDiscretizer ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array

[GitHub] spark pull request #14640: [SPARK-17055] add labelKFold to CrossValidator

2016-08-14 Thread VinceShieh
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/14640 [SPARK-17055] add labelKFold to CrossValidator ## What changes were proposed in this pull request? This patch improves the CrossValidator by adding a new training/validation split