[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20126 Hm, I see. Will open a followup PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20126 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85560/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85560 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85560/testReport)** for PR 20126 at commit [`2a9dff4`](https://github.com/apache/spark/commit/2a9dff449c9dc4a18e9a5d7f042450760bb9af2d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20126 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85560 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85560/testReport)** for PR 20126 at commit [`2a9dff4`](https://github.com/apache/spark/commit/2a9dff449c9dc4a18e9a5d7f042450760bb9af2d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20069: [SPARK-22895] [SQL] Push down the deterministic p...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20069#discussion_r159141412 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -851,7 +851,7 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { case filter @ Filter(condition, union: Union) => // Union could change the rows, so non-deterministic predicate can't be pushed down - val (pushDown, stayUp) = splitConjunctivePredicates(condition).span(_.deterministic) + val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition(_.deterministic) --- End diff -- What does it mean "after the first non-deterministic"? Doesn't this simply partition predicates to deterministic and non-deterministic? Have it considered "first" non-deterministic? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20126 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20126 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85559/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85559 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85559/testReport)** for PR 20126 at commit [`23cc79b`](https://github.com/apache/spark/commit/23cc79b0cecb33555e8f6374cc03eddccce86445). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85559 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85559/testReport)** for PR 20126 at commit [`23cc79b`](https://github.com/apache/spark/commit/23cc79b0cecb33555e8f6374cc03eddccce86445). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20125 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20125 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/8/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20125 **[Test build #8 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/8/testReport)** for PR 20125 at commit [`5cae64b`](https://github.com/apache/spark/commit/5cae64b0da57a3f45b54bcc39c18463d3945a934). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20125 I actually think https://github.com/apache/spark/pull/20125#issuecomment-354604768 are good points and I was hesitant about it. Although IMHO I think it's fine but let me cc @hvanhovell and @rxin too, who reviewed my related PRs before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20125 > Btw, is this any difference than using string? Like: Nope, they will be the same but I was thinking this is a simplest fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20125 Yup, I was thinking of SparkSQL only feature. For more details, the original intention was to support multiple values for `nullValue` but I realised such option support can be generallised - there were several issues about this since CSV is thirdparty library (I will find and give some links if requested). Also, there is one reference in R too: ```R > d <- "col1,col2 + 1,3 + 2,4" > df <- read.csv(text=d, na.strings=c("3", "2")) > df ``` ``` col1 col2 11 NA 2 NA4 ``` For more context, original proposal (Scala/SQL/Python/Java) here - https://github.com/apache/spark/pull/16611 touched many files and I received an advice to make this smaller, which I liked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20125 Is this a special feature for SparkSQL only? Seems Hive doesn't have such support. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85558/testReport)** for PR 20126 at commit [`85639dd`](https://github.com/apache/spark/commit/85639dd220e8fcb0489febc0414b51d22c0e41a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20126 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20126 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85556/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85556 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85556/testReport)** for PR 20126 at commit [`ada1c4c`](https://github.com/apache/spark/commit/ada1c4c7e1c175ee821c0ac191fc1decc3701f68). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20127: [SPARK-22932] [SQL] Refactor AnalysisContext
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20127 **[Test build #85557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85557/testReport)** for PR 20127 at commit [`f158a95`](https://github.com/apache/spark/commit/f158a951b779e56e06d2c73234bac5c79055b2f5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20127: [SPARK-22932] [SQL] Refactor AnalysisContext
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20127 cc @cloud-fan @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20127: [SPARK-22932] [SQL] Refactor AnalysisContext
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20127 [SPARK-22932] [SQL] Refactor AnalysisContext ## What changes were proposed in this pull request? Add a `reset` function to ensure the state in `AnalysisContext ` is per-query. ## How was this patch tested? The existing test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark refactorAnalysisContext Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20127 commit f158a951b779e56e06d2c73234bac5c79055b2f5 Author: gatorsmile Date: 2017-12-31T13:21:13Z refactor --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20126: [DO-NOT-MERGE] Investigate if changes in flume.py actual...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20126 **[Test build #85556 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85556/testReport)** for PR 20126 at commit [`ada1c4c`](https://github.com/apache/spark/commit/ada1c4c7e1c175ee821c0ac191fc1decc3701f68). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20126: [DO-NOT-MERGE] Investigate if changes in flume.py...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20126 [DO-NOT-MERGE] Investigate if changes in flume.py actually triggeres related tests ## What changes were proposed in this pull request? Do not merge this. Seems the changes in `flume.py` not actually triggering related tests. It's easy to test on Jenkins env. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark investigate-streaming Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20126 commit ada1c4c7e1c175ee821c0ac191fc1decc3701f68 Author: hyukjinkwon Date: 2017-12-31T13:23:53Z Investigate if changes in flume.py actually triggeres related tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19991 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19991: [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat ...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/19991 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19715 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/19715 Merged to master. If there are any further small comments / clean ups we can do that during QA for 2.3 Thanks @huaxingao and all others for review! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85554/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85554/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20125 **[Test build #8 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/8/testReport)** for PR 20125 at commit [`5cae64b`](https://github.com/apache/spark/commit/5cae64b0da57a3f45b54bcc39c18463d3945a934). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20125 cc @gatorsmile could you take a look please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20125: [SPARK-17967][SQL] Support for array as an option...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20125 [SPARK-17967][SQL] Support for array as an option in SQL parser ## What changes were proposed in this pull request? This PR targets to add the ability for dealing with an array (JSON array) in `tablePropertyValue` rule. **SQL** ```sql CREATE TEMPORARY TABLE tableA USING csv OPTIONS (nullValue [2012, 1.1, 'null'], ...) ``` ## How was this patch tested? Manually tested and test cases added in `DDLParserSuite.scala`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-17967-sql Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20125 commit 5cae64b0da57a3f45b54bcc39c18463d3945a934 Author: hyukjinkwon Date: 2017-12-31T10:27:00Z Support for array as an option in SQL --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20076#discussion_r159136922 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CompressionCodecPrecedenceSuite.scala --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql --- End diff -- Should we move this to `org.apache.spark.sql.execution.datasources.parquet`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20124 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20124 **[Test build #85553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85553/testReport)** for PR 20124 at commit [`53521ca`](https://github.com/apache/spark/commit/53521cac9d39bf9682d67d94d46adde357db1b43). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20124 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85553/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85554/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20124 This basically works by splitting the array of ParamMaps into two. One that has params that can be optimized by the estimator, and one that can be parallelized over. These are then grouped together so that the estimator can fit a sequence of Models. This allows us to reuse the previous API for fitting multiple Models and still keep the parallelization logic pretty straightforward. Model specific optimization support is just how it was before there was any parallelism introduced too. I can explain in further detail or make a design document if needed. cc @MLnick @WeichenXu123 @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20124: [WIP][SPARK-22126][ML] Fix model-specific optimization s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20124 **[Test build #85553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85553/testReport)** for PR 20124 at commit [`53521ca`](https://github.com/apache/spark/commit/53521cac9d39bf9682d67d94d46adde357db1b43). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20124: [WIP][SPARK-22126][ML] Fix model-specific optimiz...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/20124 [WIP][SPARK-22126][ML] Fix model-specific optimization support for ML tuning. ## What changes were proposed in this pull request? Support model-specific optimizations for CrossValidator and TrainValidationSplit by grouping `ParamMap`s so that param groups can fit models in parallel, but still allow `Estimator`s to optimally fit a sequence of models themselves. This PR adds a new API to `Estimator` that can be overridden to indicate optimized params, and additional functions in `ParamGridBuilder` to group `ParamMap` arrays that can then be used by the meta-algorithms. ## How was this patch tested? WIP, need to add tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark wip-model-specific-tuning-SPARK-22126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20124 commit c4ff7ab016f440a6f1684f79fdfe677507fca279 Author: Bryan Cutler Date: 2017-12-01T00:24:55Z added model specific optimization to parallel TVS commit 47a40399250af2f777e53475b5dee812bf244788 Author: Bryan Cutler Date: 2017-12-01T17:52:10Z remove unused import commit 4d113386a2ae20bae0c4e54860386103c82ae627 Author: Bryan Cutler Date: 2017-12-14T19:01:06Z moved splitting of param maps to ParamGridBuilder commit 6599cbac79375686b78792ff7c50c85749e4a6cf Author: Bryan Cutler Date: 2017-12-15T00:58:41Z got param map split working commit 47781a15cd4d2307a6268d86cd693394e227d842 Author: Bryan Cutler Date: 2017-12-15T07:03:54Z added pipeline getOptimizedParams commit 0a887bc656e9485d247add1f6de34c299da4c19d Author: Bryan Cutler Date: 2017-12-15T07:47:56Z moved param grouping to ParamGridBuilder.groupByParam commit f7256e649fb6aa1e63baca5159e919fbde30dd24 Author: Bryan Cutler Date: 2017-12-18T05:56:57Z remove unused import commit 7a53f57403ef17753e13cb099ac4866edabc5778 Author: Bryan Cutler Date: 2017-12-31T07:18:46Z fix CrossValidator to use grouped params commit 994accd402d87639ed70d3cd594f883633a0d849 Author: Bryan Cutler Date: 2017-12-31T07:44:34Z fixed style checks and added docs commit 53521cac9d39bf9682d67d94d46adde357db1b43 Author: Bryan Cutler Date: 2017-12-31T07:46:27Z added doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85552/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85552 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85552/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/20072#discussion_r159135028 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -261,6 +261,17 @@ object SQLConf { .booleanConf .createWithDefault(false) + val HADOOPFSRELATION_SIZE_FACTOR = buildConf( +"org.apache.spark.sql.execution.datasources.sizeFactor") --- End diff -- Is this config for all data sources or only hadoopFS-related data sources? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/20072#discussion_r159134987 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -261,6 +261,17 @@ object SQLConf { .booleanConf .createWithDefault(false) + val HADOOPFSRELATION_SIZE_FACTOR = buildConf( --- End diff -- How about `DISK_TO_MEMORY_SIZE_FACTOR`? IMHO the current name doesn't describe the purpose clearly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/20072#discussion_r159135036 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala --- @@ -60,6 +60,8 @@ case class HadoopFsRelation( } } + private val hadoopFSSizeFactor = sqlContext.conf.hadoopFSSizeFactor --- End diff -- shall we move it into the method `sizeInBytes` since it's only used there? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/20072#discussion_r159135272 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala --- @@ -82,7 +84,15 @@ case class HadoopFsRelation( } } - override def sizeInBytes: Long = location.sizeInBytes + override def sizeInBytes: Long = { +val size = location.sizeInBytes * hadoopFSSizeFactor +if (size > Long.MaxValue) { --- End diff -- I think this branch can be removed? `Long.MaxValue` is returned when converting a double value larger than `Long.MaxValue`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org