spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception

2018-09-06 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.4 b632e775c -> 085f731ad [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the serializa

spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception

2018-09-06 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 7ef6d1daf -> 3b6591b0b [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the serialization

spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3

2018-08-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 42c1fdd22 -> f5983823e [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3 ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. A

spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values

2018-08-23 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 8ed044928 -> b5e118808 [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. ## How

spark git commit: [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds

2018-08-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 99d2e4e00 -> 72ecfd095 [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds ## What changes were proposed in this pull request? runParallelPersonalizedPageRank in graphx checks that `sources` are <= I

spark git commit: [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented.

2018-07-20 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 244bcff19 -> 3cb1b5780 [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented. ## What changes were proposed in this pull request? Followup for #21719. Update spark.ml training code to fully wrap instrumented methods and rem

spark git commit: [SPARK-24747][ML] Make Instrumentation class more flexible

2018-07-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 7688ce88b -> 912634b00 [SPARK-24747][ML] Make Instrumentation class more flexible ## What changes were proposed in this pull request? This PR updates the Instrumentation class to make it more flexible and a little bit easier to use. When

spark git commit: [SPARK-7132][ML] Add fit with validation set to spark.ml GBT

2018-05-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master a33dcf4a0 -> ffaefe755 [SPARK-7132][ML] Add fit with validation set to spark.ml GBT ## What changes were proposed in this pull request? Add fit with validation set to spark.ml GBT ## How was this patch tested? Will add later. Author: We

spark git commit: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master a7a9b1837 -> 439c69511 [SPARK-24114] Add instrumentation to FPGrowth. ## What changes were proposed in this pull request? Have FPGrowth keep track of model training using the Instrumentation class. ## How was this patch tested? manually

spark git commit: [SPARK-22210][ML] Add seed for LDA variationalTopicInference

2018-05-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 991726f31 -> bfd75cdfb [SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference in

spark git commit: [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API

2018-05-15 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 6b94420f6 -> 8a13c5096 [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API ## What changes were proposed in this pull request? See SPARK-23455 for reference. Now default params in ML are saved separately

spark git commit: [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-05-09 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 628c7b517 -> 7aaa148f5 [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs ## What changes were proposed in this pull request? Provide evaluateEachIteration method or equivalent for spark.ml GBTs. ## Ho

spark git commit: [MINOR][ML][DOC] Improved Naive Bayes user guide explanation

2018-05-09 Thread jkbradley
ser guide page. I also improved the wording and organization slightly. ## How was this patch tested? Built docs locally. Author: Joseph K. Bradley Closes #21272 from jkbradley/nb-doc-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/

spark git commit: [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master f48bd6bdc -> 76ecd0950 [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan ## What changes were proposed in this pull request? PrefixSpan API for spark.ml. New implementation instead of #20810 ## How was this patc

spark git commit: [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning

2018-05-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 1c9c5de95 -> f48bd6bdc [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning ## What changes were proposed in this pull request? ML test for StructuredStreaming: spark.ml.tuning ## How was this patch tested? N/A Autho

spark git commit: [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master d83e96372 -> 56a52e0a5 [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark ## What changes were proposed in this pull request? Change FPGrowth from private to private[spark]. If no numParti

spark git commit: [SPARK-23990][ML] Instruments logging improvements - ML regression package

2018-04-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 83013752e -> 379bffa05 [SPARK-23990][ML] Instruments logging improvements - ML regression package ## What changes were proposed in this pull request? Instruments logging improvements - ML regression package I add an `OptionalInstrument` c

spark git commit: [SPARK-23455][ML] Default Params in ML should be saved separately in metadata

2018-04-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master ce7ba2e98 -> 83013752e [SPARK-23455][ML] Default Params in ML should be saved separately in metadata ## What changes were proposed in this pull request? We save ML's user-supplied params and default params as one entity in metadata. Durin

spark git commit: [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 55c4ca88a -> 2a24c481d [SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features ## What changes were proposed in this pull request? - Multiple possible input types is added in validateAndTransformSchema() and computeC

spark git commit: [SPARK-24026][ML] Add Power Iteration Clustering to spark.ml

2018-04-19 Thread jkbradley
y author is wangmiao1981 ## How was this patch tested? This PR has 2 types of tests: * Copies of tests from spark.mllib's PIC tests * New tests specific to the spark.ml APIs Author: wm...@hotmail.com Author: wangmiao1981 Author: Joseph K. Bradley Closes #21090 from jkbradley/wan

spark git commit: [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master f39e82ce1 -> 1ca3c50fe [SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer ## What changes were proposed in this pull request? Python API for DataFrame-based multivariate summarizer. ## How was this patch te

spark git commit: [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API

2018-04-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 5003736ad -> 04614820e [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during CrossVa

spark git commit: [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel

2018-04-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 083cf2235 -> 5003736ad [SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel add RawPrediction as output column add numClasses and numFeatures to OneVsRestModel ## What changes were proposed in this pull reque

spark git commit: [SPARK-23751][FOLLOW-UP] fix build for scala-2.12

2018-04-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 0b19122d4 -> 0f93b91a7 [SPARK-23751][FOLLOW-UP] fix build for scala-2.12 ## What changes were proposed in this pull request? fix build for scala-2.12 ## How was this patch tested? Manual. Author: WeichenXu Closes #21051 from WeichenXu

spark git commit: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 75a183071 -> 9d960de08 typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch

spark git commit: typo rawPredicition changed to rawPrediction

2018-04-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 acfc156df -> 03a4dfd69 typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this pa

spark git commit: [SPARK-22883][ML] ML test for StructuredStreaming: spark.ml.feature, I-M

2018-04-11 Thread jkbradley
dds structured streaming tests using testTransformer for these suites: * IDF * Imputer * Interaction * MaxAbsScaler * MinHashLSH * MinMaxScaler * NGram ## How was this patch tested? It is a bunch of tests! Author: Joseph K. Bradley Author: Joseph K. Bradley Closes #21042 from jkbradley/SPARK-22883-pa

spark git commit: [SPARK-22883] ML test for StructuredStreaming: spark.ml.feature, I-M

2018-04-11 Thread jkbradley
ter * Interaction * MaxAbsScaler * MinHashLSH * MinMaxScaler * NGram ## How was this patch tested? It is a bunch of tests! Author: Joseph K. Bradley Closes #20964 from jkbradley/SPARK-22883-part2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/

spark git commit: [SPARK-23944][ML] Add the set method for the two LSHModel

2018-04-10 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4f1e8b9bb -> 7c7570d46 [SPARK-23944][ML] Add the set method for the two LSHModel ## What changes were proposed in this pull request? Add two set method for LSHModel in LSH.scala, BucketedRandomProjectionLSH.scala, and MinHashLSH.scala ##

spark git commit: [SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid

2018-04-10 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master adb222b95 -> 4f1e8b9bb [SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid ## What changes were proposed in this pull request? add python api for VectorAssembler handleInvalid ## How was this patch tested? Add doct

spark git commit: [SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-10 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master e17965891 -> adb222b95 [SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml ## What changes were proposed in this pull request? Kolmogorov-Smirnoff test Python API in `pyspark.ml` **Note** API with `CDF` is a litt

spark git commit: [SPARK-14681][ML] Provide label/impurity stats for spark.ml decision tree nodes

2018-04-09 Thread jkbradley
ide val rootNode: ClassificationNode class DecisionTreeRegressionModel override val rootNode: RegressionNode ``` Closes #17466 ## How was this patch tested? UT will be added soon. Author: WeichenXu Author: jkbradley Closes #20786 from WeichenXu123/tree_stat_api_2. Project: http://git-

spark git commit: [SPARK-23859][ML] Initial PR for Instrumentation improvements: UUID and logging levels

2018-04-06 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c926acf71 -> d23a805f9 [SPARK-23859][ML] Initial PR for Instrumentation improvements: UUID and logging levels ## What changes were proposed in this pull request? Initial PR for Instrumentation improvements: UUID and logging levels. This P

spark git commit: [SPARK-23870][ML] Forward RFormula handleInvalid Param to VectorAssembler to handle invalid values in non-string columns

2018-04-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4807d381b -> f2ac08795 [SPARK-23870][ML] Forward RFormula handleInvalid Param to VectorAssembler to handle invalid values in non-string columns ## What changes were proposed in this pull request? `handleInvalid` Param was forwarded to the

spark git commit: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-04-02 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 28ea4e314 -> a1351828d [SPARK-23690][ML] Add handleinvalid to VectorAssembler ## What changes were proposed in this pull request? Introduce `handleInvalid` parameter in `VectorAssembler` that can take in `"keep", "skip", "error"` options.

spark git commit: [MINOR] Fix Java lint from new JavaKolmogorovSmirnovTestSuite

2018-03-21 Thread jkbradley
of JavaKolmogorovSmirnovTestSuite Author: Joseph K. Bradley Closes #20875 from jkbradley/kstest-lint-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a091ee67 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a091ee67 Diff: http://git-

spark git commit: [SPARK-10884][ML] Support prediction on single instance for regression and classification related models

2018-03-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 500b21c3d -> bf09f2f71 [SPARK-10884][ML] Support prediction on single instance for regression and classification related models ## What changes were proposed in this pull request? Support prediction on single instance for regression and c

spark git commit: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib

2018-03-20 Thread jkbradley
for `KolmogorovSmirnovTest` in `mllib.stat`. ## How was this patch tested? Test suite added. Author: WeichenXu Author: jkbradley Closes #19108 from WeichenXu123/ml-ks-test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f5e8

spark git commit: [SPARK-23728][BRANCH-2.3] Fix ML tests with expected exceptions running streaming tests

2018-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 80e79430f -> 920493949 [SPARK-23728][BRANCH-2.3] Fix ML tests with expected exceptions running streaming tests ## What changes were proposed in this pull request? The testTransformerByInterceptingException failed to catch the expected

[1/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 1098933b0 -> 279b3db89 http://git-wip-us.apache.org/repos/asf/spark/blob/279b3db8/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala -- diff --git a/

[2/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
[SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z # What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - Polynom

[1/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 f3efbfa4b -> 0663b6119 http://git-wip-us.apache.org/repos/asf/spark/blob/0663b611/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala -- diff --git

[2/2] spark git commit: [SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z

2018-03-14 Thread jkbradley
[SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z # What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - Polynom

spark git commit: [SPARK-18630][PYTHON][ML] Move del method from JavaParams to JavaWrapper; add tests

2018-03-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 508573958 -> 7706eea6a [SPARK-18630][PYTHON][ML] Move del method from JavaParams to JavaWrapper; add tests The `__del__` method that explicitly detaches the object was moved from `JavaParams` to `JavaWrapper` class, this way model summari

spark git commit: [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification

2018-03-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4586eada4 -> 98a5c0a35 [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classificati

spark git commit: [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification

2018-03-05 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 232b9f81f -> 4550673b1 [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classifi

spark git commit: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H

2018-03-01 Thread jkbradley
ses #20111 from jkbradley/SPARK-22883-streaming-featureAM. (cherry picked from commit 119f6a0e4729aa952e811d2047790a32ee90bf69) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56cfbd93 Tree: h

spark git commit: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H

2018-03-01 Thread jkbradley
111 from jkbradley/SPARK-22883-streaming-featureAM. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/119f6a0e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/119f6a0e Diff: http://git-wip-us.apache.org/repos/asf/spark/d

spark git commit: [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2

2018-02-21 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 a95c3e29d -> 1cc34f3e5 [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2 ## What changes were proposed in this pull request? for branch-2.2 only drops the rows containing NaN in the input colu

spark git commit: [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug

2018-02-15 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 03960faa6 -> 0bd7765cd [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug ## What changes were proposed in this pull request? Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will

spark git commit: [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug

2018-02-15 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 6968c3cfd -> db45daab9 [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug ## What changes were proposed in this pull request? Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will chec

spark git commit: [SPARK-23154][ML][DOC] Document backwards compatibility guarantees for ML persistence

2018-02-13 Thread jkbradley
ML models and Pipelines from old Spark versions. Discussed & confirmed on linked JIRA. Author: Joseph K. Bradley Closes #20592 from jkbradley/SPARK-23154-backwards-compat-doc. (cherry picked from commit d58fe28836639e68e262812d911f167cb071007b) Signed-off-by: Joseph K. Bradley Projec

spark git commit: [SPARK-23154][ML][DOC] Document backwards compatibility guarantees for ML persistence

2018-02-13 Thread jkbradley
ML models and Pipelines from old Spark versions. Discussed & confirmed on linked JIRA. Author: Joseph K. Bradley Closes #20592 from jkbradley/SPARK-23154-backwards-compat-doc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark

spark git commit: [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator.

2018-01-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 863ffdc8a -> 833a584bb [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. ## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid u

spark git commit: [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator.

2018-01-16 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 12db365b4 -> 4371466b3 [SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. ## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using

spark git commit: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 6bb22961c -> 55695c712 [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu Closes #2020

spark git commit: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 186bf8fb2 -> b5042d75c [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu Closes #20209 fr

spark git commit: [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 f891ee324 -> 2ec302658 [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline ## What changes were proposed in this pull request? Including VectorSizeHint in RFormula piplelines will allow them to be applied to str

spark git commit: [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline

2018-01-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 6f7aaed80 -> 186bf8fb2 [SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline ## What changes were proposed in this pull request? Including VectorSizeHint in RFormula piplelines will allow them to be applied to streami

spark git commit: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator

2018-01-05 Thread jkbradley
zed the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions. I also made some small style cleanups based on IntelliJ warnings. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley Closes #20132 from j

spark git commit: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator

2018-01-05 Thread jkbradley
the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions. I also made some small style cleanups based on IntelliJ warnings. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley Closes #20132 from jkbradle

spark git commit: [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit

2018-01-04 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.3 145820bda -> 5b524cc0c [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit ## What changes were proposed in this pull request? Avoid holding all models in memory for `TrainVal

spark git commit: [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit

2018-01-04 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 52fc5c17d -> cf0aa6557 [SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit ## What changes were proposed in this pull request? Avoid holding all models in memory for `TrainValidat

spark git commit: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator

2017-12-31 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 5955a2d0f -> 994065d89 [SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator ## What changes were proposed in this pull request? This patch adds a new class `OneHotEncoderEstimator` which extends `Estimator`. The

spark git commit: [SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 816963043 -> 2ea17afb6 [SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test ## What changes were proposed in this pull request? ML regression package testsuite add StructuredStreaming test In order to make

spark git commit: [SPARK-22734][ML][PYSPARK] Added Python API for VectorSizeHint.

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 30fcdc038 -> 816963043 [SPARK-22734][ML][PYSPARK] Added Python API for VectorSizeHint. (Please fill in changes proposed in this fix) Python API for VectorSizeHint Transformer. (Please explain how this patch was tested. E.g. unit tests, in

spark git commit: [SPARK-22922][ML][PYSPARK] Pyspark portion of the fit-multiple API

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master ccda75b0d -> 30fcdc038 [SPARK-22922][ML][PYSPARK] Pyspark portion of the fit-multiple API ## What changes were proposed in this pull request? Adding fitMultiple API to `Estimator` with default implementation. Also update have ml.tuning me

spark git commit: [SPARK-22905][ML][FOLLOWUP] Fix GaussianMixtureModel save

2017-12-29 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4e9e6aee4 -> afc364146 [SPARK-22905][ML][FOLLOWUP] Fix GaussianMixtureModel save ## What changes were proposed in this pull request? make sure model data is stored in order. WeichenXu123 ## How was this patch tested? existing tests Autho

spark git commit: [SPARK-22905][MLLIB] Fix ChiSqSelectorModel save implementation

2017-12-28 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master ffe6fd77a -> c74573084 [SPARK-22905][MLLIB] Fix ChiSqSelectorModel save implementation ## What changes were proposed in this pull request? Currently, in `ChiSqSelectorModel`, save: ``` spark.createDataFrame(dataArray).repartition(1).write.

spark git commit: [SPARK-22899][ML][STREAMING] Fix OneVsRestModel transform on streaming data failed.

2017-12-27 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 774715d5c -> 753793bc8 [SPARK-22899][ML][STREAMING] Fix OneVsRestModel transform on streaming data failed. ## What changes were proposed in this pull request? Fix OneVsRestModel transform on streaming data failed. ## How was this patch t

spark git commit: [SPARK-22707][ML] Optimize CrossValidator memory occupation by models in fitting

2017-12-24 Thread jkbradley
PR to fix it. ## Discussion I give 3 approaches which we can compare, after discussion I realized none of them is ideal, we have to make a trade-off. **After discussion with jkbradley , choose approach 3** ### Approach 1 ~~The approach proposed by MrBago at~~ https://github.com/apache/spark/p

spark git commit: [SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming

2017-12-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 13190a4f6 -> d23dc5b8e [SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming ## What changes were proposed in this pull request? A new VectorSizeHint transformer was added. This transformer is meant

spark git commit: [SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test

2017-12-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c7d014861 -> 0e36ba621 [SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test ## What changes were proposed in this pull request? We need to add some helper code to make testing ML transformers & models easier with str

spark git commit: [SPARK-21866][ML][PYSPARK] Adding spark image reader

2017-11-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 0605ad761 -> 1edb3175d [SPARK-21866][ML][PYSPARK] Adding spark image reader ## What changes were proposed in this pull request? Adding spark image reader, an implementation of schema for representing images in spark DataFrames The code is

spark git commit: [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid

2017-11-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 774398045 -> 1e6f76059 [SPARK-12375][ML] VectorIndexerModel support handle unseen categories via handleInvalid ## What changes were proposed in this pull request? Support skip/error/keep strategy, similar to `StringIndexer`. Implemented v

spark git commit: [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala

2017-11-14 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b00972259 -> 774398045 [SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala ## What changes were proposed in this pull request? We add a parameter whether to collect the full model list when Cross

spark git commit: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark

2017-11-13 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c8b7f97b8 -> d8741b2b0 [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark ## What changes were proposed in this pull request? Fix doc issue mentioned here: https://github.com/apache/spark/pull/19122#issuecomment-340111

spark git commit: [SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark

2017-10-27 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b3d8fc3dc -> 20eb95e5e [SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark. ## How was this patch tested? Test

spark git commit: [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic)

2017-10-25 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 9ed64048a -> 35725f735 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed f

spark git commit: [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic)

2017-10-25 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b377ef133 -> 841f1d776 [SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed for `

spark git commit: [SPARK-14371][MLLIB] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-10-18 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 1f25d8683 -> 52facb006 [SPARK-14371][MLLIB] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver Hi, # What changes were proposed in this pull request? as it was proposed by jkbradley , ```gammat``` are

spark git commit: [SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug

2017-09-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 3e6a714c9 -> f180b6534 [SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug ## What changes were proposed in this pull request? Currently the param of CrossValidator/TrainValidationSplit persist/loading is hard

spark git commit: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 63098dc31 -> b606dc177 [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getS

spark git commit: [SPARK-18608][ML] Fix double caching

2017-09-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b9b54b1c8 -> c5f9b89dd [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStora

spark git commit: [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark

2017-09-12 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 515910e9b -> 720c94fe7 [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one v

spark git commit: [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-09-01 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master aba9492d2 -> 900f14f6f [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns ## What changes were proposed in this pull request? Add test for prediction using the model with all combinations

spark git commit: [SPARK-21862][ML] Add overflow check in PCA

2017-08-31 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 96028e36b -> f5e10a34e [SPARK-21862][ML] Add overflow check in PCA ## What changes were proposed in this pull request? add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures` ar

spark git commit: [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary

2017-08-31 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master cba69aeb4 -> 96028e36b [SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary ## What changes were proposed in this pull request? add an "asBinary" method to LogisticRegressionSummary

spark git commit: [MINOR][ML] Document treatment of instance weights in logreg summary

2017-08-29 Thread jkbradley
ion summary traits. Author: Joseph K. Bradley Closes #19071 from jkbradley/lr-summary-minor. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/840ba053 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/840ba053 Diff: h

spark git commit: [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression

2017-08-28 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 73e64f7d5 -> c7270a46f [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression ## What changes were proposed in this pull request? Add 4 traits, using the following hierarchy: LogisticRegressionSummary LogisticRegressionTrain

spark git commit: [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2)

2017-08-24 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-2.2 a58536741 -> 2b4bd7910 [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of https://github.com/apache/sp

spark git commit: [SPARK-12664][ML] Expose probability in mlp model

2017-08-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master d58a3507e -> d6b30edd4 [SPARK-12664][ML] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column

spark git commit: [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero

2017-08-22 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 01a8e4627 -> d56c26210 [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero ## What changes were proposed in this pull request? fix bug of MLOR do not work correctly when featureStd contains zero We can r

spark git commit: [SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages

2017-08-11 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b0bdfce9c -> 35db3b9fe [SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages ## What changes were proposed in this pull request? Implemented a Python-only persistence framework for pipelines containing stages that ca

spark git commit: [SPARK-21542][ML][PYTHON] Python persistence helper functions

2017-08-07 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master baf5cac0f -> fdcee028a [SPARK-21542][ML][PYTHON] Python persistence helper functions ## What changes were proposed in this pull request? Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to

spark git commit: [SPARK-21633][ML][PYTHON] UnaryTransformer in Python

2017-08-04 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 25826c77d -> 1347b2a69 [SPARK-21633][ML][PYTHON] UnaryTransformer in Python ## What changes were proposed in this pull request? Implemented UnaryTransformer in Python. ## How was this patch tested? This patch was tested by creating a Moc

spark git commit: [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest

2017-07-17 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4ce735eed -> 7047f49f4 [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest ## What changes were proposed in this pull request? Added functionality for CrossValidator and TrainValidationSpli

spark git commit: [SPARK-20929][ML] LinearSVC should use its own threshold param

2017-06-19 Thread jkbradley
to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup. Project: h

spark git commit: [SPARK-20929][ML] LinearSVC should use its own threshold param

2017-06-19 Thread jkbradley
to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup. (che

spark git commit: [SPARK-21050][ML] Word2vec persistence overflow bug fix

2017-06-12 Thread jkbradley
ery easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley Closes #18265 from jkbradley/word2

  1   2   3   4   5   6   7   8   >