spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception

2018-09-06 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.4 b632e775c -> 085f731ad [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the

spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception

2018-09-06 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 7ef6d1daf -> 3b6591b0b [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the

spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3

2018-08-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 42c1fdd22 -> f5983823e [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3 ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value.

spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values

2018-08-23 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 8ed044928 -> b5e118808 [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. ## How

spark git commit: [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds

2018-08-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 99d2e4e00 -> 72ecfd095 [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds ## What changes were proposed in this pull request? runParallelPersonalizedPageRank in graphx checks that `sources` are <=

spark git commit: [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented.

2018-07-20 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 244bcff19 -> 3cb1b5780 [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented. ## What changes were proposed in this pull request? Followup for #21719. Update spark.ml training code to fully wrap instrumented methods and

spark git commit: [SPARK-24747][ML] Make Instrumentation class more flexible

2018-07-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 7688ce88b -> 912634b00 [SPARK-24747][ML] Make Instrumentation class more flexible ## What changes were proposed in this pull request? This PR updates the Instrumentation class to make it more flexible and a little bit easier to use. When

spark git commit: [SPARK-7132][ML] Add fit with validation set to spark.ml GBT

2018-05-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master a33dcf4a0 -> ffaefe755 [SPARK-7132][ML] Add fit with validation set to spark.ml GBT ## What changes were proposed in this pull request? Add fit with validation set to spark.ml GBT ## How was this patch tested? Will add later. Author:

spark git commit: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master a7a9b1837 -> 439c69511 [SPARK-24114] Add instrumentation to FPGrowth. ## What changes were proposed in this pull request? Have FPGrowth keep track of model training using the Instrumentation class. ## How was this patch tested? manually

spark git commit: [SPARK-22210][ML] Add seed for LDA variationalTopicInference

2018-05-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 991726f31 -> bfd75cdfb [SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference

spark git commit: [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API

2018-05-15 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 6b94420f6 -> 8a13c5096 [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API ## What changes were proposed in this pull request? See SPARK-23455 for reference. Now default params in ML are saved

spark git commit: [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-05-09 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 628c7b517 -> 7aaa148f5 [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs ## What changes were proposed in this pull request? Provide evaluateEachIteration method or equivalent for spark.ml GBTs. ##

spark git commit: [MINOR][ML][DOC] Improved Naive Bayes user guide explanation

2018-05-09 Thread jkbradley
ser guide page. I also improved the wording and organization slightly. ## How was this patch tested? Built docs locally. Author: Joseph K. Bradley <jos...@databricks.com> Closes #21272 from jkbradley/nb-doc-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http:

spark git commit: [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master f48bd6bdc -> 76ecd0950 [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan ## What changes were proposed in this pull request? PrefixSpan API for spark.ml. New implementation instead of #20810 ## How was this

spark git commit: [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning

2018-05-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 1c9c5de95 -> f48bd6bdc [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning ## What changes were proposed in this pull request? ML test for StructuredStreaming: spark.ml.tuning ## How was this patch tested? N/A

spark git commit: [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master d83e96372 -> 56a52e0a5 [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark ## What changes were proposed in this pull request? Change FPGrowth from private to private[spark]. If no

spark git commit: [SPARK-23990][ML] Instruments logging improvements - ML regression package

2018-04-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 83013752e -> 379bffa05 [SPARK-23990][ML] Instruments logging improvements - ML regression package ## What changes were proposed in this pull request? Instruments logging improvements - ML regression package I add an `OptionalInstrument`

spark git commit: [SPARK-23455][ML] Default Params in ML should be saved separately in metadata

2018-04-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master ce7ba2e98 -> 83013752e [SPARK-23455][ML] Default Params in ML should be saved separately in metadata ## What changes were proposed in this pull request? We save ML's user-supplied params and default params as one entity in metadata.

spark git commit: [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 55c4ca88a -> 2a24c481d [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features ## What changes were proposed in this pull request? - Multiple possible input types is added in validateAndTransformSchema() and

spark git commit: [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml

2018-04-19 Thread jkbradley
K. Bradley <jos...@databricks.com> Closes #21090 from jkbradley/wangmiao1981-pic. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a471880a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a471880a Diff: http://git-wip-us.apach

spark git commit: [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master f39e82ce1 -> 1ca3c50fe [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer ## What changes were proposed in this pull request? Python API for DataFrame-based multivariate summarizer. ## How was this patch

spark git commit: [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API

2018-04-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 5003736ad -> 04614820e [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during

spark git commit: [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel

2018-04-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 083cf2235 -> 5003736ad [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel add RawPrediction as output column add numClasses and numFeatures to OneVsRestModel ## What changes were proposed in this pull

spark git commit: [SPARK-23751][FOLLOW-UP] fix build for scala-2.12

2018-04-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 0b19122d4 -> 0f93b91a7 [SPARK-23751][FOLLOW-UP] fix build for scala-2.12 ## What changes were proposed in this pull request? fix build for scala-2.12 ## How was this patch tested? Manual. Author: WeichenXu

spark git commit: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 75a183071 -> 9d960de08 typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch

spark git commit: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 acfc156df -> 03a4dfd69 typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this

spark git commit: [SPARK-22883][ML] ML test for StructuredStreaming: spark.ml.feature, I-M

2018-04-11 Thread jkbradley
1042 from jkbradley/SPARK-22883-part2-2.3backport. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/acfc156d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/acfc156d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff

spark git commit: [SPARK-22883] ML test for StructuredStreaming: spark.ml.feature, I-M

2018-04-11 Thread jkbradley
ter * Interaction * MaxAbsScaler * MinHashLSH * MinMaxScaler * NGram ## How was this patch tested? It is a bunch of tests! Author: Joseph K. Bradley <jos...@databricks.com> Closes #20964 from jkbradley/SPARK-22883-part2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http:

spark git commit: [SPARK-23944][ML] Add the set method for the two LSHModel

2018-04-10 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4f1e8b9bb -> 7c7570d46 [SPARK-23944][ML] Add the set method for the two LSHModel ## What changes were proposed in this pull request? Add two set method for LSHModel in LSH.scala, BucketedRandomProjectionLSH.scala, and MinHashLSH.scala

spark git commit: [SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid

2018-04-10 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master adb222b95 -> 4f1e8b9bb [SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid ## What changes were proposed in this pull request? add python api for VectorAssembler handleInvalid ## How was this patch tested? Add

spark git commit: [SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-10 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master e17965891 -> adb222b95 [SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml ## What changes were proposed in this pull request? Kolmogorov-Smirnoff test Python API in `pyspark.ml` **Note** API with `CDF` is a

spark git commit: [SPARK-14681][ML] Provide label/impurity stats for spark.ml decision tree nodes

2018-04-09 Thread jkbradley
ide val rootNode: ClassificationNode class DecisionTreeRegressionModel override val rootNode: RegressionNode ``` Closes #17466 ## How was this patch tested? UT will be added soon. Author: WeichenXu <weichen...@databricks.com> Author: jkbradley <joseph.kurata.brad...@gmail.com> Clo

spark git commit: [SPARK-23859][ML] Initial PR for Instrumentation improvements: UUID and logging levels

2018-04-06 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c926acf71 -> d23a805f9 [SPARK-23859][ML] Initial PR for Instrumentation improvements: UUID and logging levels ## What changes were proposed in this pull request? Initial PR for Instrumentation improvements: UUID and logging levels. This

spark git commit: [SPARK-23870][ML] Forward RFormula handleInvalid Param to VectorAssembler to handle invalid values in non-string columns

2018-04-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4807d381b -> f2ac08795 [SPARK-23870][ML] Forward RFormula handleInvalid Param to VectorAssembler to handle invalid values in non-string columns ## What changes were proposed in this pull request? `handleInvalid` Param was forwarded to

spark git commit: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-04-02 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 28ea4e314 -> a1351828d [SPARK-23690][ML] Add handleinvalid to VectorAssembler ## What changes were proposed in this pull request? Introduce `handleInvalid` parameter in `VectorAssembler` that can take in `"keep", "skip", "error"`

spark git commit: [MINOR] Fix Java lint from new JavaKolmogorovSmirnovTestSuite

2018-03-21 Thread jkbradley
ion of JavaKolmogorovSmirnovTestSuite Author: Joseph K. Bradley <jos...@databricks.com> Closes #20875 from jkbradley/kstest-lint-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a091ee67 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree

spark git commit: [SPARK-10884][ML] Support prediction on single instance for regression and classification related models

2018-03-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 500b21c3d -> bf09f2f71 [SPARK-10884][ML] Support prediction on single instance for regression and classification related models ## What changes were proposed in this pull request? Support prediction on single instance for regression and

spark git commit: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib

2018-03-20 Thread jkbradley
ace for `KolmogorovSmirnovTest` in `mllib.stat`. ## How was this patch tested? Test suite added. Author: WeichenXu <weichen...@databricks.com> Author: jkbradley <joseph.kurata.brad...@gmail.com> Closes #19108 from WeichenXu123/ml-ks-test. Project: http://git-wip-us.apache.org/repos/asf/spark/re

spark git commit: [SPARK-23728][BRANCH-2.3] Fix ML tests with expected exceptions running streaming tests

2018-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 80e79430f -> 920493949 [SPARK-23728][BRANCH-2.3] Fix ML tests with expected exceptions running streaming tests ## What changes were proposed in this pull request? The testTransformerByInterceptingException failed to catch the

[2/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
[SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z # What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite -

[1/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 1098933b0 -> 279b3db89 http://git-wip-us.apache.org/repos/asf/spark/blob/279b3db8/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala -- diff --git

[1/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 f3efbfa4b -> 0663b6119 http://git-wip-us.apache.org/repos/asf/spark/blob/0663b611/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala -- diff

[2/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
[SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z # What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite -

spark git commit: [SPARK-18630][PYTHON][ML] Move del method from JavaParams to JavaWrapper; add tests

2018-03-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 508573958 -> 7706eea6a [SPARK-18630][PYTHON][ML] Move del method from JavaParams to JavaWrapper; add tests The `__del__` method that explicitly detaches the object was moved from `JavaParams` to `JavaWrapper` class, this way model

spark git commit: [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification

2018-03-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4586eada4 -> 98a5c0a35 [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in

spark git commit: [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification

2018-03-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 232b9f81f -> 4550673b1 [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in

spark git commit: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H

2018-03-01 Thread jkbradley
lt;jos...@databricks.com> Closes #20111 from jkbradley/SPARK-22883-streaming-featureAM. (cherry picked from commit 119f6a0e4729aa952e811d2047790a32ee90bf69) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wi

spark git commit: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H

2018-03-01 Thread jkbradley
lt;jos...@databricks.com> Closes #20111 from jkbradley/SPARK-22883-streaming-featureAM. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/119f6a0e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/119f6a0e Diff: http:

spark git commit: [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2

2018-02-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 a95c3e29d -> 1cc34f3e5 [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2 ## What changes were proposed in this pull request? for branch-2.2 only drops the rows containing NaN in the input

spark git commit: [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug

2018-02-15 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 03960faa6 -> 0bd7765cd [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug ## What changes were proposed in this pull request? Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will

spark git commit: [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug

2018-02-15 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 6968c3cfd -> db45daab9 [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug ## What changes were proposed in this pull request? Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will

spark git commit: [SPARK-23154][ML][DOC] Document backwards compatibility guarantees for ML persistence

2018-02-13 Thread jkbradley
ML models and Pipelines from old Spark versions. Discussed & confirmed on linked JIRA. Author: Joseph K. Bradley <jos...@databricks.com> Closes #20592 from jkbradley/SPARK-23154-backwards-compat-doc. (cherry picked from commit d58fe28836639e68e262812d911f167cb071007b) Signed-off

spark git commit: [SPARK-23154][ML][DOC] Document backwards compatibility guarantees for ML persistence

2018-02-13 Thread jkbradley
ML models and Pipelines from old Spark versions. Discussed & confirmed on linked JIRA. Author: Joseph K. Bradley <jos...@databricks.com> Closes #20592 from jkbradley/SPARK-23154-backwards-compat-doc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apa

spark git commit: [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator.

2018-01-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 863ffdc8a -> 833a584bb [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. ## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid

spark git commit: [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator.

2018-01-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 12db365b4 -> 4371466b3 [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. ## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid

spark git commit: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 6bb22961c -> 55695c712 [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu

spark git commit: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 186bf8fb2 -> b5042d75c [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu

spark git commit: [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 f891ee324 -> 2ec302658 [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline ## What changes were proposed in this pull request? Including VectorSizeHint in RFormula piplelines will allow them to be applied to

spark git commit: [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 6f7aaed80 -> 186bf8fb2 [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline ## What changes were proposed in this pull request? Including VectorSizeHint in RFormula piplelines will allow them to be applied to

spark git commit: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator

2018-01-05 Thread jkbradley
; Closes #20132 from jkbradley/viirya-SPARK-13030. (cherry picked from commit 930b90a84871e2504b57ed50efa7b8bb52d3ba44) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commi

spark git commit: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator

2018-01-05 Thread jkbradley
es #20132 from jkbradley/viirya-SPARK-13030. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/930b90a8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/930b90a8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff

spark git commit: [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit

2018-01-04 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 145820bda -> 5b524cc0c [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit ## What changes were proposed in this pull request? Avoid holding all models in memory for

spark git commit: [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit

2018-01-04 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 52fc5c17d -> cf0aa6557 [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit ## What changes were proposed in this pull request? Avoid holding all models in memory for

spark git commit: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator

2017-12-31 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 5955a2d0f -> 994065d89 [SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator ## What changes were proposed in this pull request? This patch adds a new class `OneHotEncoderEstimator` which extends `Estimator`. The

spark git commit: [SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 816963043 -> 2ea17afb6 [SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test ## What changes were proposed in this pull request? ML regression package testsuite add StructuredStreaming test In order to

spark git commit: [SPARK-22922][ML][PYSPARK] Pyspark portion of the fit-multiple API

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master ccda75b0d -> 30fcdc038 [SPARK-22922][ML][PYSPARK] Pyspark portion of the fit-multiple API ## What changes were proposed in this pull request? Adding fitMultiple API to `Estimator` with default implementation. Also update have ml.tuning

spark git commit: [SPARK-22905][ML][FOLLOWUP] Fix GaussianMixtureModel save

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4e9e6aee4 -> afc364146 [SPARK-22905][ML][FOLLOWUP] Fix GaussianMixtureModel save ## What changes were proposed in this pull request? make sure model data is stored in order. WeichenXu123 ## How was this patch tested? existing tests

spark git commit: [SPARK-22905][MLLIB] Fix ChiSqSelectorModel save implementation

2017-12-28 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master ffe6fd77a -> c74573084 [SPARK-22905][MLLIB] Fix ChiSqSelectorModel save implementation ## What changes were proposed in this pull request? Currently, in `ChiSqSelectorModel`, save: ```

spark git commit: [SPARK-22899][ML][STREAMING] Fix OneVsRestModel transform on streaming data failed.

2017-12-27 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 774715d5c -> 753793bc8 [SPARK-22899][ML][STREAMING] Fix OneVsRestModel transform on streaming data failed. ## What changes were proposed in this pull request? Fix OneVsRestModel transform on streaming data failed. ## How was this patch

spark git commit: [SPARK-22707][ML] Optimize CrossValidator memory occupation by models in fitting

2017-12-24 Thread jkbradley
PR to fix it. ## Discussion I give 3 approaches which we can compare, after discussion I realized none of them is ideal, we have to make a trade-off. **After discussion with jkbradley , choose approach 3** ### Approach 1 ~~The approach proposed by MrBago at~~ https://github.com/apache/spark/p

spark git commit: [SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming

2017-12-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 13190a4f6 -> d23dc5b8e [SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming ## What changes were proposed in this pull request? A new VectorSizeHint transformer was added. This transformer is meant

spark git commit: [SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test

2017-12-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c7d014861 -> 0e36ba621 [SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test ## What changes were proposed in this pull request? We need to add some helper code to make testing ML transformers & models easier with

spark git commit: [SPARK-21866][ML][PYSPARK] Adding spark image reader

2017-11-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 0605ad761 -> 1edb3175d [SPARK-21866][ML][PYSPARK] Adding spark image reader ## What changes were proposed in this pull request? Adding spark image reader, an implementation of schema for representing images in spark DataFrames The code

spark git commit: [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid

2017-11-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 774398045 -> 1e6f76059 [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid ## What changes were proposed in this pull request? Support skip/error/keep strategy, similar to `StringIndexer`. Implemented

spark git commit: [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala

2017-11-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b00972259 -> 774398045 [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala ## What changes were proposed in this pull request? We add a parameter whether to collect the full model list when

spark git commit: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark

2017-11-13 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c8b7f97b8 -> d8741b2b0 [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark ## What changes were proposed in this pull request? Fix doc issue mentioned here:

spark git commit: [SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark

2017-10-27 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b3d8fc3dc -> 20eb95e5e [SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark. ## How was this patch tested?

spark git commit: [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic)

2017-10-25 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 9ed64048a -> 35725f735 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed

spark git commit: [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic)

2017-10-25 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b377ef133 -> 841f1d776 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed for

spark git commit: [SPARK-14371][MLLIB] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-10-18 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 1f25d8683 -> 52facb006 [SPARK-14371][MLLIB] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver Hi, # What changes were proposed in this pull request? as it was proposed by jkbradley , ```gam

spark git commit: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug

2017-09-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 3e6a714c9 -> f180b6534 [SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug ## What changes were proposed in this pull request? Currently the param of CrossValidator/TrainValidationSplit persist/loading is

spark git commit: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 63098dc31 -> b606dc177 [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in

spark git commit: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b9b54b1c8 -> c5f9b89dd [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in

spark git commit: [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark

2017-09-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 515910e9b -> 720c94fe7 [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one

spark git commit: [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-09-01 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master aba9492d2 -> 900f14f6f [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns ## What changes were proposed in this pull request? Add test for prediction using the model with all combinations

spark git commit: [SPARK-21862][ML] Add overflow check in PCA

2017-08-31 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 96028e36b -> f5e10a34e [SPARK-21862][ML] Add overflow check in PCA ## What changes were proposed in this pull request? add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures`

spark git commit: [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary

2017-08-31 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master cba69aeb4 -> 96028e36b [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary ## What changes were proposed in this pull request? add an "asBinary" method to LogisticRegressionSummary

spark git commit: [MINOR][ML] Document treatment of instance weights in logreg summary

2017-08-29 Thread jkbradley
ion summary traits. Author: Joseph K. Bradley <jos...@databricks.com> Closes #19071 from jkbradley/lr-summary-minor. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/840ba053 Tree: http://git-wip-us.apache.org/repos/asf/s

spark git commit: [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression

2017-08-28 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 73e64f7d5 -> c7270a46f [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression ## What changes were proposed in this pull request? Add 4 traits, using the following hierarchy: LogisticRegressionSummary

spark git commit: [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2)

2017-08-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 a58536741 -> 2b4bd7910 [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of

spark git commit: [SPARK-12664][ML] Expose probability in mlp model

2017-08-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master d58a3507e -> d6b30edd4 [SPARK-12664][ML] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column

spark git commit: [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero

2017-08-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 01a8e4627 -> d56c26210 [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero ## What changes were proposed in this pull request? fix bug of MLOR do not work correctly when featureStd contains zero We can

spark git commit: [SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages

2017-08-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b0bdfce9c -> 35db3b9fe [SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages ## What changes were proposed in this pull request? Implemented a Python-only persistence framework for pipelines containing stages that

spark git commit: [SPARK-21542][ML][PYTHON] Python persistence helper functions

2017-08-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master baf5cac0f -> fdcee028a [SPARK-21542][ML][PYTHON] Python persistence helper functions ## What changes were proposed in this pull request? Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to

spark git commit: [SPARK-21633][ML][PYTHON] UnaryTransformer in Python

2017-08-04 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 25826c77d -> 1347b2a69 [SPARK-21633][ML][PYTHON] UnaryTransformer in Python ## What changes were proposed in this pull request? Implemented UnaryTransformer in Python. ## How was this patch tested? This patch was tested by creating a

spark git commit: [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest

2017-07-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4ce735eed -> 7047f49f4 [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest ## What changes were proposed in this pull request? Added functionality for CrossValidator and

spark git commit: [SPARK-20929][ML] LinearSVC should use its own threshold param

2017-06-20 Thread jkbradley
ies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <jos...@databricks.com> Closes #18151 from jkbradley/ml-2.2-

spark git commit: [SPARK-20929][ML] LinearSVC should use its own threshold param

2017-06-20 Thread jkbradley
ies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <jos...@databricks.com> Closes #18151 from jkbradley/ml-2.2-

spark git commit: [SPARK-21050][ML] Word2vec persistence overflow bug fix

2017-06-12 Thread jkbradley
es #18265 from jkbradley/word2vec-save-fix. (cherry picked from commit ff318c0d2f283c3f46491f229f82d93714da40c7) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/48a8

spark git commit: [SPARK-21050][ML] Word2vec persistence overflow bug fix

2017-06-12 Thread jkbradley
8265 from jkbradley/word2vec-save-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ff318c0d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ff318c0d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ff318c0d Bra

  1   2   3   4   5   6   7   >