Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/23144
Using an optional `normalize` function argument maybe OK, I will have a try.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/23144
@srowen To adopt an optional `normalize` function argument, we may need to
create a new class `StringParam` and add the argument into it. But this will be
a breaking change, since existing
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/23144
I am not sure about `$$` or `%%`, we can replace them with other names.
I want to resolve the confusion of case-insensitivity, and wonder whether a
new flag can do this.
If we want to
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/23122#discussion_r236537309
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -671,7 +671,7 @@ class ALS(@Since("1.4.0") override val u
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/23144
[SPARK-26172][ML][WIP] Unify String Params' case-insensitivity in ML
## What changes were proposed in this pull request?
1, methods `lowerCaseInArray` and `upperCaseInArray` are creat
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/22991#discussion_r236110139
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -219,14 +225,20 @@ final class OneVsRestModel private[ml
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22991
friendly ping @srowen @jkbradley @MLnick
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r235886910
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/23123
[SPARK-26153][ML] GBT & RandomForest avoid unnecessary `first` job to
compute `numFeatures`
## What changes were proposed in this pull request?
use base models' `numFeature` i
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/23122
[MINOR][ML] add missing params to Instr
## What changes were proposed in this pull request?
add following param to instr:
GBTC: validationTol
GBTR: validationTol
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
@srowen Yes, this is the problem. I have to register `Param*` before any
prediction model, but there are too many anonymous classes in
`ParamValidators` and other places, and I have not found
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22087
I also expose GMM's predictProbability.
could you please make a final pass? @srowen @felixcheung
---
-
To unsubs
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
@srowen I have some spare time, and will work on it.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/19927
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22975
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/22991
[SPARK-25989][ML] OneVsRestModel handle empty outputCols incorrectly
## What changes were proposed in this pull request?
ignore empty output columns
## How was this patch tested
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22975
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
not all public serializable classes are needed to registered. Only those
one which needed ser-deser should be registered, one important groups should be
transformers and prediction models
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
I am not sure, but maybe all serializable classes need to be registered.
Since `MultivariateGaussian` is a public class, so I think we need to add
it.
I also wonder whether a test is
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
Do you mean fail in this pr? It was caused by a non-registered filed
`BDM[Double]`.
`MultivariateGaussian` is used in GMM, kryo-registration should help
performance.
As to mllib
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22974
@srowen Existing kryo-register testsuite need to import spark-core:
```
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
val conf = new
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22975
@srowen Yes, we should keep user input data and column names. Thanks for
your explain!
---
-
To unsubscribe, e-mail
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/22975
[SPARK-20156][SQL][ML][FOLLOW-UP] Java String toLowerCase with Locale.ROOT
## What changes were proposed in this pull request?
Add `Locale.ROOT` to all internal calls to String
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/22974
[SPARK-22450][Core][MLLib][FollowUp] Safely register MultivariateGaussian
## What changes were proposed in this pull request?
register following classes in Kryo
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22971
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/22971
[SPARK-25970][ML] Add Instrumentation to PrefixSpan
## What changes were proposed in this pull request?
Add Instrumentation to PrefixSpan
## How was this patch tested
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22087
Sounds good to design a universal prediction model as a super-class.
BTW, I think we can also create a new class `ProbabilisticPredictionModel`
(as a subclass of `PredictionModel`), so
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19927
@srowen How do you think about this? Current OVR model's transform is too
slow. Thanks.
---
-
To unsubscribe, e
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22087
@imatiach-msft Updated according to your comments! Thanks for your
reviewing!
---
-
To unsubscribe, e-mail: reviews
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21561#discussion_r210468639
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21561#discussion_r210467653
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21561#discussion_r210158840
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -299,7 +299,7 @@ class KMeans private (
val bcCenters
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/22087
@felixcheung Testsuites is added. Thanks for reviewing!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/22087
[SPARK-25097][Support prediction on single instance in KMeans/BiKMeans/GMM]
Support prediction on single instance in KMeans/BiKMeans/GMM
## What changes were proposed in this pull request
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21561#discussion_r209498032
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -151,13 +152,9 @@ class BisectingKMeans private
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21561#discussion_r209496789
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -157,11 +157,15 @@ class NaiveBayes @Since("
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/19084
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21563
@mengxr I notice that you open a ticket for supporting integer type labels
in ClusteringEvalutator, would you like to shepherd this pr too
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/19186
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/20918
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/18589
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/18389
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21563
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20028
LGTM, except for the since annotations.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21788
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21563
@mgaido91 I am sorry to make a force push to update my git username in this
PR.
Since I found that my current PRs are not linked to my account and it is
troublesome to track them
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21788
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21788
@felixcheung I have to force push it so as to change the git username. I
will look for what happend
---
-
To unsubscribe
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21792
@srowen I think we need to update the docs
1, Current doc in `StringIndexer` is somewhat misleading: "The indices are
in `[0, numLabels)`, ordered by label frequencies, so the most fre
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/21792
[SPARK-23231][ML][DOC] Add doc for string indexer ordering to user guide
(also to RFormula guide)
## What changes were proposed in this pull request?
add doc for string indexer ordering
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/21788
[SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explain
RandomForestClassifier.featureSubsetStrategy well
## What changes were proposed in this pull request?
update d
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/21562
@felixcheung Would you mind make a final pass? Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21563#discussion_r197600500
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/21563#discussion_r195618344
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
---
@@ -107,15 +106,18 @@ class ClusteringEvaluator @Since
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/16171
It is out of date, and I will close it
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/16171
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/16763
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/16763
This pr is out of date. I will close it.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19084
@srowen Could you please give a final review? Thanks
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19927
@mengxr @holdenk How do you think about this? Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/21563
[SPARK-24557][ML] ClusteringEvaluator support array input
## What changes were proposed in this pull request?
ClusteringEvaluator support array input
## How was this patch tested
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/21562
[Trivial][ML] GMM unpersist RDD after training
## What changes were proposed in this pull request?
unpersist `instances` after training
## How was this patch tested?
existing
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/18154
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/18154
This PR is out of date. I will close it.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/20164
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20164
This pr is out of date. So I will close it.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/21561
[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB
## What changes were proposed in this pull request?
logNumExamples in KMeans/BiKM/GMM/AFT/NB
## How was this patch
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19927
@MLnick @jkbradley What's your thoughts?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.or
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/19381#discussion_r180997645
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -192,12 +192,12 @@ abstract class ClassificationModel
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20956
@srowen Could you please help reviewing this? Thanks in advance
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/20956#discussion_r180064831
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/NodeIdCache.scala ---
@@ -166,9 +166,13 @@ private[spark] class NodeIdCache
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/20956#discussion_r180063562
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/NodeIdCache.scala ---
@@ -95,7 +95,7 @@ private[spark] class NodeIdCache
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20956
[SPARK-23841][ML] NodeIdCache should unpersist the last cached
nodeIdsForInstances
## What changes were proposed in this pull request?
unpersist the last cached nodeIdsForInstances in
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20918
[SPARK-23805][ML][WIP] Features alg support vector-size validation and
Inference
## What changes were proposed in this pull request?
support vector-size validation and Inference in
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/20539
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/20518#discussion_r167417459
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -745,4 +763,27 @@ private[spark] class CosineDistanceMeasure
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20539
ping @jkbradley
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20539
[SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN
- for branch-2.2
## What changes were proposed in this pull request?
for branch-2.2
only drops the rows
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/20518#discussion_r166813909
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -745,4 +763,27 @@ private[spark] class CosineDistanceMeasure
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19340
@mgaido91 agree that it is better to normalize centers
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20164
@WeichenXu123 Yes, my concern is that it is confusing if the transform
failure is caused by column conflict by a âinvisibleâ column.
@srowen Agree that it is not perfect if we
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20164
@srowen Different from the base model (like LoR), OVR and OVRModel do not
have param `rawPredictionCol`.
So if the input dataframe contains a column which has the same name as base
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19340
The updating of centers should be viewed as the **M-step** in EM algorithm,
in which some objective is optimized.
Since cosine similarity do not take vector-norm into account:
1
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19340
@mgaido91 @srowen I have the same concern as @Kevin-Ferret and @viirya
I don't find the normailization of vectors before training, and the update
of center seems incorrect.
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19892
@MLnick Thanks for your reviewing and suggestions. I have updated this PR
---
-
To unsubscribe, e-mail: reviews-unsubscr
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20275
[SPARK-23085][ML] API parity for mllib.linalg.Vectors.sparse
## What changes were proposed in this pull request?
`ML.Vectors#sparse(size: Int, elements: Seq[(Int, Double)])` support
zero
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20164
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20164
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20164
[SPARK-22971][ML] OneVsRestModel should use temporary RawPredictionCol
## What changes were proposed in this pull request?
use temporary RawPredictionCol in `OneVsRestModel#transform
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20113
@WeichenXu123 I use this cmd to list all impl of model.save, and others
looks OK.
`find mllib/src/main/scala -name '*.scala' | xargs -i bash -c 'egrep -in
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19892
ping @MLnick ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20113
[SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel save
## What changes were proposed in this pull request?
make sure model data is stored in order. @WeichenXu123
Github user zhengruifeng closed the pull request at:
https://github.com/apache/spark/pull/20030
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
GitHub user zhengruifeng opened a pull request:
https://github.com/apache/spark/pull/20030
[SPARK-10496][CORE] Efficient RDD cumulative sum
## What changes were proposed in this pull request?
impl Efficient RDD cumulative sum
## How was this patch tested?
existing
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19950
@WeichenXu123 I am not very sure, but it seems that `Kryo` will automatic
ser/deser `Tuple2[A, B]` type if both `A` and `B` have been registered:
```
scala> imp
Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/20017
ping @srowen
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
1 - 100 of 878 matches
Mail list logo