[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

2017-11-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19588 @hhbyyh comments addressed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19350: [SPARK-22126][ML] Fix model-specific optimization...

2017-11-03 Thread WeichenXu123
GitHub user WeichenXu123 reopened a pull request: https://github.com/apache/spark/pull/19350 [SPARK-22126][ML] Fix model-specific optimization support for ML tuning ## What changes were proposed in this pull request? Push down fitting parallelization code from

[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...

2017-11-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19641 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19586 We can config the class to register by config `spark.kryo.classesToRegister`, does it need to add into spark code

[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-11-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19586 and in `ml`, if we want to register class before running algos, Some other classes like `LabeledPoint`, `Instance` also need registered. and there're some class temporary defined in

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148926895 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0"

[GitHub] spark issue #19661: [SPARK-22450][Core][Mllib]safely register class for mlli...

2017-11-05 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19661 So why do you include the class such as `org.apache.spark.ml.feature.Instance`. You can look into a lot of algos, in `ml` package (not `mllib`), still use something like `RDD[Instance

[GitHub] spark issue #19661: [SPARK-22450][Core][Mllib]safely register class for mlli...

2017-11-05 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19661 And I don't know whether these class dependency injection into spark-core lib is reasonable ... --- - To unsubscri

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17000 @MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the vectors will be too large to store in driver memory, so we slice the vectors into different machines (stored by `RDD

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19666 [SPARK-22451][ML] Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories) ## What changes were proposed in this pull request? We do not

[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16864 @jiangxb1987 yes I agree to close it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r149269770 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -101,6 +101,20 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149274660 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -976,6 +930,44 @@ private[spark] object RandomForest extends

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 @smurching I guess if iterating over gray code will have higher time complexity O(n * 2^n), (Not very sure, maybe there's some high efficient algos?) , the recursive traverse in my PR

[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19020 LGTM. thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel indexe...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19662 Looks reasonable, have you check other places which have similar issue ? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #19685: [SPARK-19759][ML] not using blas in ALSModel.pred...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19685#discussion_r149554146 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -289,9 +289,11 @@ class ALSModel private[ml] ( private

[GitHub] spark issue #19685: [SPARK-19759][ML] not using blas in ALSModel.predict for...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19685 Have you made some test to check the performance difference for this ? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 @facaiy Thanks for your review! I put more explanation on the design purpose of `traverseUnorderedSplits`. But, if you have better solution, no hesitate to tell me

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 Also cc @smurching Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149561550 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -741,17 +678,43 @@ private[spark] object RandomForest extends

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 ok I agree this change. @jkbradley Can you take a look ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19666#discussion_r149567340 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -631,6 +614,42 @@ class RandomForestSuite extends

[GitHub] spark pull request #19662: [SPARK-22446][SQL][ML] Declare StringIndexerModel...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19662#discussion_r149567769 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala --- @@ -126,4 +126,25 @@ class VectorAssemblerSuite

[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...

2017-11-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r149854985 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -94,46 +97,86 @@ object Summarizer extends Logging { * - min

[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...

2017-11-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r149855295 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -197,14 +240,14 @@ private[ml] object SummaryBuilderImpl extends

[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r149893125 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -94,46 +97,86 @@ object Summarizer extends Logging { * - min

[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r149941345 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -94,46 +98,87 @@ object Summarizer extends Logging { * - min

[GitHub] spark pull request #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interfa...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19156#discussion_r149956415 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -527,27 +570,28 @@ private[ml] object SummaryBuilderImpl extends

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 @facaiy Your idea looks also reasonable. So we can use the condition "exclude the first bin" to do the pruning (filter out the other half symmetric splits). This condition looks si

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15770 LGTM. ping @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18624#discussion_r150170451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -286,40 +288,119 @@ object

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18624 But, I agree the issue @MLnick mentioned, the code now looks convoluted, can you try to simplify it ? --- - To

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r150445283 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField

[GitHub] spark pull request #17972: [SPARK-20723][ML]Add intermediate storage level t...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17972#discussion_r150470524 --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala --- @@ -129,7 +129,7 @@ private[recommendation] trait ALSModelParams

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r150486465 --- Diff: mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19525#discussion_r150482756 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -476,6 +476,10 @@ class DenseMatrix @Since("

[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 OK. I will waiting @smurching to merge split parts of #19433 get merged first, and then I will update this PR. --- - To

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r150731403 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -177,7 +202,9 @@ class TrainValidationSplit @Since("

[GitHub] spark issue #17972: [SPARK-20723][ML]Add intermediate storage level to tree ...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17972 Have you checked other algorithms which can also apply this parameter ? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 I manually tested backwards compatibility and it works fine. I paste the test code for `CrossValidator` here. Run following code in spark-2.2 shell first: ``` import

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r150760733 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r150761099 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r150771136 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -311,22 +342,39 @@ class VectorIndexerModel private[ml

[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19588 Sure, I will add python api after this is merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 @viirya @MLnick Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19588 Python API jira created here: https://issues.apache.org/jira/browse/SPARK-22521 --- - To unsubscribe, e-mail: reviews

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19753 [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API ## What changes were proposed in this pull request? Add python api for

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19588#discussion_r151055221 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala --- @@ -311,22 +346,39 @@ class VectorIndexerModel private[ml

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19621 I want to ask, for option `StringIndexer.frequencyDesc`, in the case existing two labels which have the same frequency, which of them will be put in the front ? If this is not specified

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19753 @smurching The getter/setter is included in the super class `HasHandleInvalid`. I can add test for it. --- - To

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151311569 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151311832 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja

[GitHub] spark issue #19627: [SPARK-21088][ML][WIP] CrossValidator, TrainValidationSp...

2017-11-16 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19627 My local test passed. This test failure looks like test system issue. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 ParamMaps...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19152 @marktab You should close merged PR. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya I guess the reason is, the old PR version: `df.withColumn(..).withColumn(..).withColumn(..)`, the long df chain prevent the shuffle re-using... but now you merge them into one step

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 Great! That's it. thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional comman

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya Yes. But if there is some better design I will be happy to listen. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya It is possible I think. A similar example is, `HasRegParam` trait, do not put `setRegParam` in trait but moved into concrete estimator/transformer class, should be the same reason

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 Looks not the reason. maybe issues somewhere else. Let me run test later. Thanks! But there is some small issues in test: Don't include gen data time: ``` val

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya I run the code, you're right, most of time cost on the executedPlan generation (The old version code). thanks! But can you append benchmark comparison with `RDD.aggregate` ve

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r139470472 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r139467949 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya No, keep the dataframe version code. But I only want to confirm how much performance gap between this and RDD version. (for possible improvements in the future, because in similar test

[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16774 OK. I will separate a PR. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya Scala `with trait` is a complex mechanism and `trait` isn't equivalent to java's `interface`. Scala compiler will precompile and generate many other codes, so java-side c

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 @smurching Thanks! I will update later. And note that I will separate part of this PR to a new PR (the separated part will be a bugfix for #16774

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @viirya Oh, I am not saying the compatibility against old version scala application. What I say is about new version `Bucketizer`, when spark user use java language(not scala language), call

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 Yes you can only move `setInputCols` into the outer class to resolve this issue. But I prefer merge it together. I think we can unify the `transform` method. (First we check param `inputCol

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 Oh. That's what have done in the old PR #18902 .(Because the RDD version (not in master branch, only personal impl here (sorry for put wrong link, the code link is here: https://githu

[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19186 @zhengruifeng You should write a design doc succinctly and link to jira first. After we reach agreement with the design this PR can move forward. Thanks

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19278 [SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug ## What changes were proposed in this pull request? Currently the param of CrossValidator

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 cc @BryanCutler Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #12066: [SPARK-7424] [ML] ML ClassificationModel should add meta...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/12066 @yanboliang Are you still working on this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19106 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139854853 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -283,6 +282,8 @@ object CrossValidatorModel extends MLReadable

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139854872 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -122,7 +123,7 @@ class TrainValidationSplit @Since("

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139854984 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -212,14 +213,12 @@ object CrossValidator extends MLReadable

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r139855087 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -303,16 +304,16 @@ object CrossValidatorModel extends

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 @BryanCutler The reason I add `skipParams` is that, if we don't use `DefaultParamReader.getAndSetParams`, we have to hardcoding all params which are very troublesome. And every time we ad

[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 @smurching I will update this PR after #19278 merged. Because now this PR depend on that one. Thanks! --- - To unsubscribe

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17819 @MLnick Yea, you're right, only move `setXXX` to concrete class also work fine. The root cause is the `setXXX` return type. But I think the multi / single logic can be merged, because s

[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19229 @viirya Thanks very much! Although the perf gap exists (when numCols is large), it won't block this PR. I will create a JIRA to track

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140111453 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19288: [SPARK-22075][ML] GBTs unpersist datasets cached by Chec...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19288 OK I agree. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19106 ping @srowen Any other comments ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #14325: [SPARK-16692] [ML] Add multi label classification evalua...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14325 You should override `override def evaluate(dataset: Dataset[_])` (without the label param). --- - To unsubscribe, e-mail

[GitHub] spark issue #14325: [SPARK-16692] [ML] Add multi label classification evalua...

2017-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14325 ping @gatorsmile Add this to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140188363 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140249325 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r140398857 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala --- @@ -160,11 +160,13 @@ class TrainValidationSplitSuite

[GitHub] spark pull request #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19278#discussion_r140398933 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -159,12 +159,15 @@ class CrossValidatorSuite

[GitHub] spark issue #19279: [SPARK-22061] [ML]add pipeline model of SVM

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19279 cc @srowen Can you help close this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13794 cc @srowen Can you help close this ? We won't need this feature for now. --- - To unsubscribe, e-mail: reviews-uns

[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r140402700 --- Diff: python/pyspark/ml/tests.py --- @@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self): loadedModel

[GitHub] spark issue #19278: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSpli...

2017-09-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19278 @jkbradley Sure I tested the backwards compatibility. Part of the reason I changed into `DefaultParamReader.getAndSetParams` is for backwards compatibility

[GitHub] spark issue #19156: [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of d...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19156 @cloud-fan Can you help review the part of code which related to SQL interface ? --- - To unsubscribe, e-mail: reviews

<    3   4   5   6   7   8   9   10   11   12   >