[GitHub] spark pull request: [SPARK-14509][DOC] Add python CountVectorizerE...

2016-04-11 Thread zhengruifeng
Github user zhengruifeng commented on the pull request: https://github.com/apache/spark/pull/11917#issuecomment-208689383 cc @holdenk Could you please take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-04 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19889 [SPARK-22690][ML] Imputer inherit HasOutputCols ## What changes were proposed in this pull request? make `Imputer` inherit `HasOutputCols` ## How was this patch tested

[GitHub] spark issue #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-04 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19889 No other algs output multi-column for now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-05 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19889 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-05 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19889 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...

2017-12-05 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19084#discussion_r154897263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala --- @@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0"

[GitHub] spark pull request #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer suppo...

2017-12-05 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19892 [SPARK-20542][FollowUp][PySpark] Bucketizer support multi-column ## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was

[GitHub] spark pull request #19894: [SPARK-22700][ML] Bucketizer.transform incorrectl...

2017-12-05 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19894 [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN ## What changes were proposed in this pull request? only drops the rows containing NaN in the input columns

[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...

2017-12-05 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 This PR is currently blocked by https://github.com/apache/spark/pull/19894#issuecomment-349315711 --- - To unsubscribe, e

[GitHub] spark issue #19894: [SPARK-22700][ML] Bucketizer.transform incorrectly drops...

2017-12-05 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19894 ping @MLnick ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews

[GitHub] spark pull request #19927: [WIP] OVR transform optimization

2017-12-07 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19927 [WIP] OVR transform optimization ## What changes were proposed in this pull request? optimize OVR transform ## How was this patch tested? existing tests You can merge this

[GitHub] spark issue #19927: [WIP] OVR transform optimization

2017-12-07 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19927 test code: ``` import org.apache.spark.ml.classification._ val df = spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/s

[GitHub] spark pull request #19927: [SPARK-22737][ML] OVR transform optimization

2017-12-12 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19927#discussion_r156314727 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -156,54 +153,22 @@ final class OneVsRestModel private[ml

[GitHub] spark pull request #19950: [SPARK-22450][Core][MLLib][FollowUp] safely regis...

2017-12-12 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19950 [SPARK-22450][Core][MLLib][FollowUp] safely register class for mllib - LabeledPoint/VectorWithNorm/TreePoint ## What changes were proposed in this pull request? register following classes

[GitHub] spark pull request #19963: [SPARK-20849][DOC][FOLLOWUP] Document R DecisionT...

2017-12-12 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19963 [SPARK-20849][DOC][FOLLOWUP] Document R DecisionTree - Link Classification Example ## What changes were proposed in this pull request? in https://github.com/apache/spark/pull/18067, only

[GitHub] spark issue #19950: [SPARK-22450][Core][MLLib][FollowUp] safely register cla...

2017-12-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19950 Since `VectorWithNorm` and `TreePoint` do not override method `equals`, we can not directly using `===` to compare objects. `LabeledPoint` is a case class, which method `equals` is

[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...

2017-12-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...

2017-12-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...

2017-12-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 ping @holdenk , can you help reviewing this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #19530: [SPARK-22309][ML] Remove unused param in `LDAMode...

2017-10-18 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19530 [SPARK-22309][ML] Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest ## What changes were proposed in this pull request? Re

[GitHub] spark pull request #19618: [SPARK-5484][Followup] PeriodicRDDCheckpointer do...

2017-10-30 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19618 [SPARK-5484][Followup] PeriodicRDDCheckpointer doc cleanup ## What changes were proposed in this pull request? PeriodicRDDCheckpointer was already moved out of mllib in Spark-5484

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r147950431 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,462 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r147950931 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,462 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r147953998 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,462 @@ +/* + * Licensed to the Apache

[GitHub] spark issue #19618: [SPARK-5484][Followup] PeriodicRDDCheckpointer doc clean...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19618 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-11-02 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r148702816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,456 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19288: [SPARK-22075][ML] unpersist datasets cached by Pe...

2017-09-19 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19288 [SPARK-22075][ML] unpersist datasets cached by PeriodicRDDCheckpointer ## What changes were proposed in this pull request? PeriodicRDDCheckpointer will automatically persist the last 3

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Peri...

2017-09-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19288 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Peri...

2017-09-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19288 @srowen In MLlib, `PeriodicRDDCheckpointer` is only used in `GradientBoostedTrees`. I just find that there is another checkpointer `PeriodicGraphCheckpointer`, I will check it

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Peri...

2017-09-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19288 @srowen I check `LDA` : although `unpersistDataSet` is not called in it, no intermediate cached rdds is generated after `fit()`. Then I check `Pregel`, and find that each call of

[GitHub] spark issue #19288: [SPARK-22075][ML][GRAPHX] GBTs/Pregel unpersist datasets...

2017-09-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19288 @srowen I found that the cached rdds in `Pregel` is just the result graph. and the intermidiate rdds are already unpersisted directly out of the graphCheckpointer. So I think we don't ne

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Chec...

2017-09-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19288 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Chec...

2017-09-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19288 @WeichenXu123 It maybe better to destory intermediate objects ASAP --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-22 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19229 I am not familiar with SQL source, but I think it's great to transform all columns at a time --- - To unsubscribe, e

[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-26 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19186 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19490: [Trivial][DOC] update code style for InteractionE...

2017-10-12 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19490 [Trivial][DOC] update code style for InteractionExample ## What changes were proposed in this pull request? code style update no other same issues found ## How was this patch

[GitHub] spark pull request #19490: [Trivial][DOC] update code style for InteractionE...

2017-10-16 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/19490 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #14643: [SPARK-17057][ML] ProbabilisticClassifierModels' ...

2016-08-14 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/14643 [SPARK-17057][ML] ProbabilisticClassifierModels' prediction more reasonable with multi zero thresholds ## What changes were proposed in this pull request? Change the behavi

[GitHub] spark issue #14643: [SPARK-17057][ML] ProbabilisticClassifierModels' predict...

2016-08-18 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/14643 @srowen I though of `threshoulds` designed in ML just as a kind of `weight`. This design is easy to understand. Is there some other librarys (like sklearn) that support thresholds? We can

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @MLnick @yanboliang I update the performance comparison. The DF-based impl is a little slower than the RDD-based one when num of column is small. When num of column is large (100

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-17 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @yanboliang RDD-based impl the (former commit)[https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465] --- If your project is set up for it, you can

[GitHub] spark pull request #17951: [SPARK-20711][ML] Fix incorrect min/max for NaN v...

2017-08-20 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/17951 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...

2017-08-28 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/17014#discussion_r135692470 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -85,6 +86,10 @@ abstract class Predictor[ M <: PredictionMo

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-28 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @yanboliang Although dispointed by DF's performance, I also approve the choice of DF just for less code. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-29 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 @yanboliang I have updated this PR according to the comments. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer inc...

2017-08-29 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19084 [SPARK-20711][ML]MultivariateOnlineSummarizer incorrect min/max for NaN value ## What changes were proposed in this pull request? current impl of min/max ignore `NaN` for a

[GitHub] spark issue #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Summarizer...

2017-08-29 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19084 ping @WeichenXu123 @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-29 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 Agree that we should pass `handlePersistence` to mllib impl. Thanks for pointing it out! --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-29 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 Current impl of `mllib.KMeans` seems do not support caching, it just (log warnings)[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib

[GitHub] spark issue #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Summarizer...

2017-08-29 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19084 `MinMaxScalerSuite` fails because `MinMaxScaler` need the behavior of ignoring `NaN`. So I think there are 2 options: 1, `MultivariateOnlineSummarizer/Summarizer` support param

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-31 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 Sounds good. And since adding `handlePersistence` as a `ml.Param` may influences many algs (more than that in this PR), I think we may need more discussion @MLnick @yanboliang

[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/17014#discussion_r136737427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -304,16 +304,14 @@ class KMeans @Since("1.5.0") (

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 @jkbradley I am curious about why `ml.Kmeans` is special that it needs a separate PR --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-03 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is [here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465). I use the same

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-04 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-09-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @smurching OK, I will close this PR and resubmit it to the new ticket. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...

2017-09-10 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/17014 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-10 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19186 [SPARK-21972][ML] Add param handlePersistence ## What changes were proposed in this pull request? Add param handlePersistence ## How was this patch tested? existing tests

[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19186 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138237760 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -483,24 +488,17 @@ class LogisticRegression @Since

[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138243247 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -444,13 +444,13 @@ class

[GitHub] spark issue #19107: [SPARK-21799][ML] Fix `KMeans` performance regression ca...

2017-09-11 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19107 I am OK to resubmit the original PR if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark pull request #19197: [SPARK-18608][ML] Fix double caching

2017-09-11 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19197 [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala

[GitHub] spark pull request #19198: [MINOR][DOC] Add missing call of `update()` in ex...

2017-09-11 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19198 [MINOR][DOC] Add missing call of `update()` in examples of PeriodicGraphCheckpointer & PeriodicRDDCheckpointer ## What changes were proposed in this pull request? forgot to call `up

[GitHub] spark issue #19198: [MINOR][DOC] Add missing call of `update()` in examples ...

2017-09-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19198 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19197: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19197 ping @jkbradley @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-09-12 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 Any more comments on this PR? It have been about one month since the last modification. --- - To unsubscribe, e-mail

[GitHub] spark pull request #19110: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-09-12 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19110#discussion_r138517690 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -297,6 +298,16 @@ final class OneVsRest @Since("

[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19186 @WeichenXu123 Thanks a lot for pointing it out! I also forgot about this. @smurching Thanks for your solution, however, I think there maybe exist another drawback in it: The alg usually use

[GitHub] spark issue #19220: [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpa...

2017-09-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19220 LGTM Thanks for this catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-13 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19229 In the test code, should we use `model.transform(df).count` instead? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19186 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #19232: [SPARK-22009][ML] Using treeAggregate improve som...

2017-09-14 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19232 [SPARK-22009][ML] Using treeAggregate improve some algs ## What changes were proposed in this pull request? I test on a dataset of about 10M instances, and found that using

[GitHub] spark issue #19232: [SPARK-22009][ML] Using treeAggregate improve some algs

2017-09-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19232 ping @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #17384: [SPARK-20056][ML] IsotonicRegression support Nume...

2017-04-20 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/17384 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #17384: [SPARK-20056][ML] IsotonicRegression support Numeric fea...

2017-04-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17384 @MLnick Agree. I will close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18589: [SPARK-16872][ML] Add Gaussian NB

2017-07-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18589 @MLnick Sorry to reply late. It is a long time since I got the last comments in the previous PR https://github.com/apache/spark/pull/15324, so I thought that community may dislike that design

[GitHub] spark issue #18612: [SPARK-21388][ML][PySpark] GBTs inherit from HasStepSize...

2017-07-20 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18612 @holdenk I think the meaning of `StepSize` in GBT and `Threshould` in LInearSVC/Binarizer is almost same as that in other algs, so it maybe better to make them inherit from same trait

[GitHub] spark issue #18154: [SPARK-20932][ML]CountVectorizer support handle persiste...

2017-07-23 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18154 @hhbyyh @HyukjinKwon Sorry to reply late. I think it may be better to use a special logic if it is more efficient in performance. What is your opinion? @yanboliang @HyukjinKwon

[GitHub] spark issue #18610: [SPARK-21386] ML LinearRegression supports warm start fr...

2017-07-25 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18610 LGTM, this is really a great feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18610: [SPARK-21386] ML LinearRegression supports warm start fr...

2017-07-25 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18610 @yanboliang Agree. I think if there is some prarms which control the "shape" of model coefficients, than they should be override if we use inital model. Like `k` in KMeans, GMM, L

[GitHub] spark pull request #17995: [SPARK-20762][ML]Make String Params Case-Insensit...

2017-08-09 Thread zhengruifeng
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/17995 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

2017-08-09 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/18902 [SPARK-21690][ML] one-pass imputer ## What changes were proposed in this pull request? parallelize the computation of all columns ## How was this patch tested? existing tests

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 Jenkis, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh Yes, I will test the performance. Btw, the median computation by call `stat.approxQuantile` will also transform df into rdd before aggregation. see https://github.com/apache/spark

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-10 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 I test the performance on a small data, the value in the following table is the average duration in seconds: |numColums| Old Mean | Old Median | New Mean | New Median

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

2017-08-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 ping @MLnick ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-14 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh Good Idea! We can also use this trick to compute median, because method `multipleApproxQuantiles`[https://github.com/apache/spark/blob/0e80ecae300f3e2033419b2d98da8bf092c105bb/sql/core

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms

2017-08-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/16763 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

2017-08-15 Thread zhengruifeng
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh I rewrite the impl, and now all `NaN` and `missingValue` will be transform to `null` at first, then current methods are used. For columns only containing `null`, `null` is

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133372183 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133371918 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133370353 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133372968 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133370197 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133368511 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133370279 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133372318 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-08-16 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133368243 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache

<    3   4   5   6   7   8   9   >