[GitHub] spark issue #16636: [SPARK-19279] [SQL] Block Creating a Hive Table With an ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16636 **[Test build #71795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71795/testReport)** for PR 16636 at commit [`f99e078`](https://github.com/apache/spark/commit/f99e078dd677798c8d9674ea5e08e9a95b43c065). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16228 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71794/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16228 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16228 **[Test build #71794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71794/testReport)** for PR 16228 at commit [`7e52a83`](https://github.com/apache/spark/commit/7e52a837a984716ea8a0747c73f44b81bf592ff6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16671 So far, the best workaround is that predicate-based JDBC API; otherwise, as I mentioned above, we need to do it using sampling to find the boundary of each block. > In one embodiment, a user may specify a block size, via an interface. Blocks may be generated at the time of table partitioning. For example, according to a sampling technique described below, a user may select a particular block size and then the utility can determine the average number of table rows per block based on the number of storage bytes per row. Block-by boundary values for that range of rows of that block are determined based on the selected amount of rows, and provided in a query statement generated to obtain the statistical value for the block. That is, select rows from each table may be sampled or range-based. The select rows (or columns) are aggregated to form one âblockâ from the database table. The âblockâ may include the whole table, but is typically select rows of the whole table. A few years ago, I did implement the sampling based table logical partitioning. See the link: https://www.google.com/patents/US20160275150. It works pretty well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16657: [SPARK-19306][Core] Fix inconsistent state in DiskBlockO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16657 **[Test build #71800 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71800/testReport)** for PR 16657 at commit [`b0fe795`](https://github.com/apache/spark/commit/b0fe795157a41925ba38bba02ee10a79518c8e42). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15314 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71798/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/15314 re-ping @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15314 **[Test build #71798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71798/testReport)** for PR 15314 at commit [`1d41615`](https://github.com/apache/spark/commit/1d41615863e7d4a0cc225a9a32cc1b175af22a49). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15314 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16670#discussion_r97214965 --- Diff: R/pkg/inst/tests/testthat/test_Windows.R --- @@ -20,7 +20,7 @@ test_that("sparkJars tag in SparkContext", { if (.Platform$OS.type != "windows") { skip("This test is only for Windows, skipped") } - testOutput <- launchScript("ECHO", "a/b/c", capture = TRUE) + testOutput <- launchScript("ECHO", "a/b/c", wait = TRUE) --- End diff -- Hmm, I've tried, I don't think it would work. When calling `system2(.., wait = FALSE, capture = "")` the output to stdout is actually from the child process, so I don't think we would be able to see it from the R process. We could redirect it, but then it would be the same as `system2(..., wait = FALSE, capture = TRUE)` but again it wouldn't be what we are normally calling. I think we would need to dig deeper on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16579: [SPARK-19218][SQL] Fix SET command to show a result corr...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16579 Hi, @gatorsmile . This is the original PR which has two fixes together now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16624 Hi, @gatorsmile . I tested here and applied to #16579 . PR #16579 has two fixes. After merging #16579 , I'm going to close this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16609: [SPARK-8480] [CORE] [PYSPARK] [SPARKR] Add setNam...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16609#discussion_r97214415 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -172,11 +172,23 @@ class Dataset[T] private[sql]( this(sqlContext.sparkSession, logicalPlan, encoder) } - /** A friendly name for this Dataset */ + /** +* A friendly name for this Dataset. +* +* @group basic +* @since 2.2.0 +*/ @Since("2.2.0") var name: String = null - /** Assign a name to this Dataset */ + /** +* Assign a name to this Dataset to display in the UI storage tab when cached. +* +* @param name A friendly name for this Dataset --- End diff -- `_name`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16609: [SPARK-8480] [CORE] [PYSPARK] [SPARKR] Add setNam...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16609#discussion_r97214402 --- Diff: python/pyspark/sql/dataframe.py --- @@ -85,17 +85,20 @@ def rdd(self): self._lazy_rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer())) return self._lazy_rdd -@since(2.1) +@since(2.2) def name(self): """ Return the name of this Dataset. """ return self._jdf.name() @ignore_unicode_prefix -@since(2.1) +@since(2.2) def setName(self, name): -""" +"""Sets the name of this Dataset. + +The name wil be displayed on the storage tab of the UI if the Dataset is cached. --- End diff -- `wil` -> `will`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16609: [SPARK-8480] [CORE] [PYSPARK] [SPARKR] Add setNam...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16609#discussion_r97214411 --- Diff: python/pyspark/sql/dataframe.py --- @@ -85,17 +85,20 @@ def rdd(self): self._lazy_rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer())) return self._lazy_rdd -@since(2.1) +@since(2.2) def name(self): """ Return the name of this Dataset. """ return self._jdf.name() @ignore_unicode_prefix -@since(2.1) +@since(2.2) def setName(self, name): -""" +"""Sets the name of this Dataset. + +The name wil be displayed on the storage tab of the UI if the Dataset is cached. --- End diff -- I'm not sure but maybe this should say "DataFrame" instead --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16657: [SPARK-19306][Core] Fix inconsistent state in Dis...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/16657#discussion_r97214376 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala --- @@ -206,18 +209,22 @@ private[spark] class DiskBlockObjectWriter( streamOpen = false closeResources() } + } catch { + case e: Exception => +logError("Uncaught exception while closing file " + file, e) +} - val truncateStream = new FileOutputStream(file, true) - try { -truncateStream.getChannel.truncate(committedPosition) -file - } finally { -truncateStream.close() - } +var truncateStream: FileOutputStream = null +try { + truncateStream = new FileOutputStream(file, true) + truncateStream.getChannel.truncate(committedPosition) + file } catch { case e: Exception => logError("Uncaught exception while reverting partial writes to file " + file, e) file +} finally { + truncateStream.close() --- End diff -- Sorry about it. I will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16670#discussion_r97214373 --- Diff: R/pkg/inst/tests/testthat/test_Windows.R --- @@ -20,7 +20,7 @@ test_that("sparkJars tag in SparkContext", { if (.Platform$OS.type != "windows") { skip("This test is only for Windows, skipped") } - testOutput <- launchScript("ECHO", "a/b/c", capture = TRUE) + testOutput <- launchScript("ECHO", "a/b/c", wait = TRUE) --- End diff -- we could, but unfortunately, we don't actually call launchScript with wait/capture = TRUE we call wait/capture = FALSE and expect to let console/stdout to leak through, and return NULL. I'll try to add test for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15040: [WIP] [SPARK-17487] [SQL] Configurable bucketing ...
Github user tejasapatil closed the pull request at: https://github.com/apache/spark/pull/15040 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16654: [SPARK-19303][ML][WIP] Add evaluate method in clustering...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16654 **[Test build #71799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71799/testReport)** for PR 16654 at commit [`5937ce7`](https://github.com/apache/spark/commit/5937ce703df857b109982f49bca96b9c3c325587). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Using the *predicates* parameters to split the table seems reasonable, but it just put some work should be done by Spark to users in my personal opinion. Users need know how to split the table uniform at first, so it may use the `count(*)` extra to explode the distribution of the table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15505 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71792/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15505 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16642: [SPARK-19284][SQL]append to partitioned datasource table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16642 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16642: [SPARK-19284][SQL]append to partitioned datasource table...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16642 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71793/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15505 **[Test build #71792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71792/testReport)** for PR 15505 at commit [`0e2dec5`](https://github.com/apache/spark/commit/0e2dec532780f7e3a5c31582732e10e85e80f1d9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16642: [SPARK-19284][SQL]append to partitioned datasource table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16642 **[Test build #71793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71793/testReport)** for PR 16642 at commit [`f76f75b`](https://github.com/apache/spark/commit/f76f75b8e8ec804307c2b80ab4a7ceb02dcae716). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SparkListenerExecutorBlacklisted(` * `case class SparkListenerExecutorUnblacklisted(time: Long, executorId: String)` * `case class SparkListenerNodeBlacklisted(` * `case class SparkListenerNodeUnblacklisted(time: Long, hostId: String)` * `case class QualifiedTableName(database: String, name: String)` * ` class MaintenanceTask(periodMs: Long, task: => Unit, onError: => Unit) ` * `class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan] ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15314 **[Test build #71798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71798/testReport)** for PR 15314 at commit [`1d41615`](https://github.com/apache/spark/commit/1d41615863e7d4a0cc225a9a32cc1b175af22a49). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/15314 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes, this solution is not suitable for large table, but I can not find a better solution, this is the best optimisation I can find. So just add it as a choose, let the users know what he is doing, and need a explicit enable. From my experience, the origin equal step method can lead to some problem for real data. This conclusion can be get from the spark-user email and our real scenario. Such as users will use the `id` to partition the table, because the `id` is unique and with index, but after many inserts and deletes, the `id` range is very large, and data will lead to a skew distribution by `id`. Very large table is not so common, and if the large table with sharding, this method maybe acceptable. My personal opinion is: >Given another choose for users maybe valuable, only we do not enable it by default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16654: [SPARK-19303][ML][WIP] Add evaluate method in clustering...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/16654 I think now clustering metrics are not that general, comparing with classification/regression metrics: WSSSE only apply to `KMeans` and `BiKMeans` Loglikelihood only apply to `GMM` I had opened a jira about clusteringEvaluator https://issues.apache.org/jira/browse/SPARK-14516, which may add metrics included in scikit-learn http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.cluster @yanboliang @jkbradley What's your opinion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive tab...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r97213829 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -168,6 +168,43 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = DDLUtils.verifyAlterTableAddColumn(catalog, table) + +// If an exception is thrown here we can just assume the table is uncached; +// this can happen with Hive tables when the underlying catalog is in-memory. +val wasCached = Try(sparkSession.catalog.isCached(table.unquotedString)).getOrElse(false) --- End diff -- `AlterTableRenameCommand` has similar way to do the uncaching. I thought there might be a reason it exists there. So I did the same. But looking at the code, it seems you are right. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16624 Please update the PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive tab...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r97213702 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -584,14 +593,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from the old table definition, // to retain the spark specific format if it is. Also add old data source properties to table // properties, to retain the data source table format. - val oldDataSourceProps = oldTableDef.properties.filter(_._1.startsWith(DATASOURCE_PREFIX)) --- End diff -- I think the valuable name needs to change since now the hive table and datasource table both populate the table properties with the schema. Both cases will go through this path. I temporarily block the datasource table ALTER ADD columns because I am not confident yet if I have holes. But according to @gatorsmile , it may be safe to support datasource table too. So I am actually adding more test cases to confirm. I may remove the condition in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16636: [SPARK-19279] [SQL] Block Creating a Hive Table With an ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16636 **[Test build #71796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71796/testReport)** for PR 16636 at commit [`e3cc423`](https://github.com/apache/spark/commit/e3cc423e2ecf3e9128b8036905c044a3f658cd25). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16587: [SPARK-19229] [SQL] Disallow Creating Hive Source Tables...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16587 **[Test build #71797 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71797/testReport)** for PR 16587 at commit [`c6d6a24`](https://github.com/apache/spark/commit/c6d6a2448d51633c22d730c60d219aa16ac81bb1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive tab...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r97213578 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -107,7 +107,13 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); MessageType fileSchema = footer.getFileMetaData().getSchema(); FilterCompat.Filter filter = getFilter(configuration); - blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + try { +blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + } catch (IllegalArgumentException e) { +// In the case where a particular parquet files does not contain --- End diff -- Yes. we can. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16646: [SPARK-19291][SPARKR][ML] spark.gaussianMixture s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16646 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16636: [SPARK-19279] [SQL] Block Creating a Hive Table With an ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16636 **[Test build #71795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71795/testReport)** for PR 16636 at commit [`f99e078`](https://github.com/apache/spark/commit/f99e078dd677798c8d9674ea5e08e9a95b43c065). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16587: [SPARK-19229] [SQL] Disallow Creating Hive Source Tables...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16587 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16646: [SPARK-19291][SPARKR][ML] spark.gaussianMixture supports...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/16646 Merged into master. If there are new comments about the model persistence compatibility issue, we can address them in follow-up work. Thanks for all your reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16579: [SPARK-19218][SQL] Fix SET command to show a result corr...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16579 For `SET -v` without sorting, please refer #16624 , too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16516: [SPARK-19155][ML] MLlib GeneralizedLinearRegressi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16516 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16516: [SPARK-19155][ML] MLlib GeneralizedLinearRegression fami...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/16516 Merged into master, branch-2.1 and branch-2.0. Thanks for all your reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 @gatorsmile can you take a look at? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Table2 with about 5M rows, 200partition by SparkSQL. (The table using the MySQL sharding, and every partition will return 10K rows at most) old partition result(elements in each partition) >1,49,54,53,60,59,48,61,52,57,60,69,58,57,50,52,51,66,58,45,59,52,61,56,67,51,45,49,70,49,58,59,61,53,50,53,47,50,46,53,55,53,62,55,48,58,52,62,62,37,65,59,58,55,61,59,46,53,49,49,61,72,60,46,50,51,45,47,55,63,64,63,55,47,65,57,60,60,51,45,48,77,58,57,59,39,50,62,55,57,49,63,51,38,49,66,62,58,53,54,50,54,52,69,51,49,61,60,64,49,52,50,54,58,48,51,50,49,41,68,54,45,65,62,44,52,64,58,47,51,65,47,37,42,39,44,51,65,56,54,69,51,61,63,51,52,47,55,58,66,47,54,53,53,60,66,66,68,64,66,55,58,64,55,50,57,46,56,39,60,57,63,40,51,56,58,44,46,46,44,42,52,52,44,53,46,55,57,68,57,62,48,47,52,59,58,49,44,52,47 (most of data is in partition 0, but each partition will return 10K at most because our sharding limit.) new partition result(elements in each partition) >2083,1,1,6932,9799,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,8150,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,9,70,2,1,1,1,655,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,40,76,145,38,86,176,369,696,1338,2776,5381' count cost time: 0.8ms --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should limit th...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16661 ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16228 **[Test build #71794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71794/testReport)** for PR 16228 at commit [`7e52a83`](https://github.com/apache/spark/commit/7e52a837a984716ea8a0747c73f44b81bf592ff6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Here is the real data test result: Table with 1.2Million rows, 50partition by SparkSQL. old partition result(elements in each partition) >100061,100064,100059,100066,100065,100065,100066,100066,100063,100061,100066,100065,70747,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 new partition result(elements in each partition) >19543,19544,39083,39088,19544,19545,39085,19544,19542,19543,19545,39086,39087,19544,19545,39088,19544,19544,39088,19543,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,19543,19544,39086,19543,19545,39086,39086,19544,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,20701,0 count cost time: 1.27s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16671 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16671: [SparkSQL] a better balance partition method for ...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/16671 [SparkSQL] a better balance partition method for jdbc API ## What changes were proposed in this pull request? The partition method in` jdbc` using the equal step, this can lead to skew between partitions. The new method introduce a balance partition method base on the elements when split the elements, this can relieve the skew problem with a little query cost. ## How was this patch tested? UnitTest and real data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark balancePartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16671.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16671 commit 88cdf294aa579f65b8272870d762548cf54349ce Author: DjvuLeeDate: 2017-01-20T09:53:57Z [SparkSQL] a better balance partition method for jdbc API The partition method in jdbc when specify the column using te equal step, this can lead to skew between partitions. The new method introduce a new partition method base on the elements when split the elements, this can keep the elements balanced between partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16594: [SPARK-17078] [SQL] Show stats when explain
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16594#discussion_r97212822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -649,6 +649,14 @@ object SQLConf { .doubleConf .createWithDefault(0.05) + val SHOW_STATS_IN_EXPLAIN = --- End diff -- Then, when the stats are not accurate, will it be the cause of an inefficient plan? If so, why not showing them the number? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16611 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71788/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16611 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16611 **[Test build #71788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71788/testReport)** for PR 16611 at commit [`28abf86`](https://github.com/apache/spark/commit/28abf86f5543996c55910f8c097dc6ede10a7d86). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is ...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16670#discussion_r97212748 --- Diff: R/pkg/inst/tests/testthat/test_Windows.R --- @@ -20,7 +20,7 @@ test_that("sparkJars tag in SparkContext", { if (.Platform$OS.type != "windows") { skip("This test is only for Windows, skipped") } - testOutput <- launchScript("ECHO", "a/b/c", capture = TRUE) + testOutput <- launchScript("ECHO", "a/b/c", wait = TRUE) --- End diff -- Can we add a similar test with something getting printed on `stdout` from the JVM ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16669: [SPARK-16101][SQL] Refactoring CSV read path to be consi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16669 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16669: [SPARK-16101][SQL] Refactoring CSV read path to be consi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16669 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71789/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16669: [SPARK-16101][SQL] Refactoring CSV read path to be consi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16669 **[Test build #71789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71789/testReport)** for PR 16669 at commit [`b2938ae`](https://github.com/apache/spark/commit/b2938ae080ee7c36ef751b0bca57c2bfbdf99b43). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16624 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71791/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16624 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16624 **[Test build #71791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71791/testReport)** for PR 16624 at commit [`075f466`](https://github.com/apache/spark/commit/075f4667020438a650659197ac8212c785775e75). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive tab...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r97212553 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -736,6 +736,22 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { } /** + * Create a [[AlterTableAddColumnsCommand]] command. + * + * For example: + * {{{ + * ALTER TABLE table1 + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} + */ + override def visitAddTableColumns(ctx: AddTableColumnsContext): LogicalPlan = withOrigin(ctx) { +AlterTableAddColumnsCommand( + visitTableIdentifier(ctx.tableIdentifier), + Option(ctx.columns).map(visitColTypeList).getOrElse(Nil) --- End diff -- columns are not optimal for this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16659: [SPARK-19309][SQL] disable common subexpression e...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16659#discussion_r97212331 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala --- @@ -67,28 +67,33 @@ class EquivalentExpressions { /** * Adds the expression to this data structure recursively. Stops if a matching expression * is found. That is, if `expr` has already been added, its children are not added. - * If ignoreLeaf is true, leaf nodes are ignored. */ - def addExprTree( - root: Expression, - ignoreLeaf: Boolean = true, - skipReferenceToExpressions: Boolean = true): Unit = { -val skip = (root.isInstanceOf[LeafExpression] && ignoreLeaf) || + def addExprTree(expr: Expression): Unit = { +val skip = expr.isInstanceOf[LeafExpression] || // `LambdaVariable` is usually used as a loop variable, which can't be evaluated ahead of the // loop. So we can't evaluate sub-expressions containing `LambdaVariable` at the beginning. - root.find(_.isInstanceOf[LambdaVariable]).isDefined -// There are some special expressions that we should not recurse into children. + expr.find(_.isInstanceOf[LambdaVariable]).isDefined + +// There are some special expressions that we should not recurse into all of its children. // 1. CodegenFallback: it's children will not be used to generate code (call eval() instead) -// 2. ReferenceToExpressions: it's kind of an explicit sub-expression elimination. -val shouldRecurse = root match { - // TODO: some expressions implements `CodegenFallback` but can still do codegen, - // e.g. `CaseWhen`, we should support them. - case _: CodegenFallback => false - case _: ReferenceToExpressions if skipReferenceToExpressions => false - case _ => true +// 2. If: common subexpressions will always be evaluated at the beginning, but the true and --- End diff -- I just found that not all the children of `AtLeastNNonNulls` get accessed during evaluation too. Do we need to add it here too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16593 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16642: [SPARK-19284][SQL]append to partitioned datasource table...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16642 **[Test build #71793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71793/testReport)** for PR 16642 at commit [`f76f75b`](https://github.com/apache/spark/commit/f76f75b8e8ec804307c2b80ab4a7ceb02dcae716). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16593 LGTM, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15505 **[Test build #71792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71792/testReport)** for PR 15505 at commit [`0e2dec5`](https://github.com/apache/spark/commit/0e2dec532780f7e3a5c31582732e10e85e80f1d9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16552 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16552 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71785/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16552 **[Test build #71785 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71785/testReport)** for PR 16552 at commit [`cb7a1be`](https://github.com/apache/spark/commit/cb7a1bed92a111f03dd1d7464c494be5b8fed502). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SparkListenerExecutorBlacklisted(` * `case class SparkListenerExecutorUnblacklisted(time: Long, executorId: String)` * `case class SparkListenerNodeBlacklisted(` * `case class SparkListenerNodeUnblacklisted(time: Long, hostId: String)` * ` class MaintenanceTask(periodMs: Long, task: => Unit, onError: => Unit) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user witgo commented on the issue: https://github.com/apache/spark/pull/15505 @squito My understanding is that the TaskSchedulerImpl class contains many synchronized statements (synchronized the methods). If a synchronized statements execution time is very long, it will block other synchronized statements, this causes reduced performance in the TaskSchedulerImpl instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/15505#discussion_r97211797 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -602,6 +619,20 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp Future.successful(false) } -private[spark] object CoarseGrainedSchedulerBackend { +private[spark] object CoarseGrainedSchedulerBackend extends Logging { val ENDPOINT_NAME = "CoarseGrainedScheduler" + // abort TaskSetManager without exception + def abortTaskSetManager( + scheduler: TaskSchedulerImpl, + taskId: Long, + msg: => String, + exception: Option[Throwable] = None): Unit = { + scheduler.taskIdToTaskSetManager.get(taskId).foreach { taskSetMgr => + try { +taskSetMgr.abort(msg, exception) --- End diff -- `taskSetMgr.abort` is thread safety, It looks fine from the calling code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16245: [SPARK-18824][SQL] Add optimizer rule to reorder Filter ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16245 It is true of course you can construct a combination of complex string operations and compare it with a simple Scala UDF. But as you said, the previous claim is true in most of time. I also think Scala UDF is usually used to write complex logic which can't be achieved by built-in expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16245: [SPARK-18824][SQL] Add optimizer rule to reorder ...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/16245#discussion_r97211716 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -514,6 +514,34 @@ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { /** + * Reorders the predicates in `Filter` so more expensive expressions like UDF can evaluate later. + */ +object ReorderPredicatesInFilter extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(pred, child) => + // Extracts deterministic suffix expressions from Filter predicate. + val expressions = splitConjunctivePredicates(pred) + // The beginning index of the deterministic suffix expressions. + var splitIndex = -1 + (expressions.length - 1 to 0 by -1).foreach { idx => +if (splitIndex == -1 && !expressions(idx).deterministic) { + splitIndex = idx + 1 +} + } + if (splitIndex == expressions.length) { +// All expressions are non-deterministic, no reordering. +f + } else { +val (nonDeterminstics, deterministicExprs) = expressions.splitAt(splitIndex) --- End diff -- Hmm, actually that's what I mean, probably some confusing with `non-deterministic` with `non-foldable`? I think we can skip them both in a short cut evaluation. as those expressions are not `stateful`(unfortunately, Spark SQL expression doesn't have the concept of `stateful`), so skip the evaluation of them are harmless, and this is exactly the short cut logic of expression `AND`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is getting...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16670 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is getting...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16670 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71790/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is getting...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16670 **[Test build #71790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71790/testReport)** for PR 16670 at commit [`294ce99`](https://github.com/apache/spark/commit/294ce991d2e1c8d7a38b526ccf4f35a7ac41fbc1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16245: [SPARK-18824][SQL] Add optimizer rule to reorder Filter ...
Github user chenghao-intel commented on the issue: https://github.com/apache/spark/pull/16245 I think that's true in most of time for`Scala UDF needs extra conversion between internal format and external format on input and out`, not all of the time, for example, some built-in string based operations and its combinations are also quite heavy in evaluation, and most likely, this probably causes concern for an experienced SQL developers, to write an optimal(business related short-cutting logic) SQL expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16245: [SPARK-18824][SQL] Add optimizer rule to reorder ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16245#discussion_r97211599 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -514,6 +514,34 @@ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { /** + * Reorders the predicates in `Filter` so more expensive expressions like UDF can evaluate later. + */ +object ReorderPredicatesInFilter extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(pred, child) => + // Extracts deterministic suffix expressions from Filter predicate. + val expressions = splitConjunctivePredicates(pred) + // The beginning index of the deterministic suffix expressions. + var splitIndex = -1 + (expressions.length - 1 to 0 by -1).foreach { idx => +if (splitIndex == -1 && !expressions(idx).deterministic) { + splitIndex = idx + 1 +} + } + if (splitIndex == expressions.length) { +// All expressions are non-deterministic, no reordering. +f + } else { +val (nonDeterminstics, deterministicExprs) = expressions.splitAt(splitIndex) --- End diff -- yes. however, if the first expression in the `AND` is `non-deterministic`, skipping it might change its next evaluation. so we can only reorder the deterministic expressions after non-deterministic expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15219: [SPARK-14098][SQL] Generate Java code to build CachedCol...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15219 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71787/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15219: [SPARK-14098][SQL] Generate Java code to build CachedCol...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15219 **[Test build #71787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71787/testReport)** for PR 15219 at commit [`b15d9d5`](https://github.com/apache/spark/commit/b15d9d5724936f5946d99acc40b75754e8583aa6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15219: [SPARK-14098][SQL] Generate Java code to build CachedCol...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15219 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16245: [SPARK-18824][SQL] Add optimizer rule to reorder ...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/16245#discussion_r97211489 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -514,6 +514,34 @@ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { /** + * Reorders the predicates in `Filter` so more expensive expressions like UDF can evaluate later. + */ +object ReorderPredicatesInFilter extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(pred, child) => + // Extracts deterministic suffix expressions from Filter predicate. + val expressions = splitConjunctivePredicates(pred) + // The beginning index of the deterministic suffix expressions. + var splitIndex = -1 + (expressions.length - 1 to 0 by -1).foreach { idx => +if (splitIndex == -1 && !expressions(idx).deterministic) { + splitIndex = idx + 1 +} + } + if (splitIndex == expressions.length) { +// All expressions are non-deterministic, no reordering. +f + } else { +val (nonDeterminstics, deterministicExprs) = expressions.splitAt(splitIndex) --- End diff -- I mean `(rand() > 0) && b)` should equals to `b && (rand() >0)`, and even, the latter probably has better performance, due to the short cut evaluation of `AND`. isn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71784/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71784/testReport)** for PR 16593 at commit [`7bdc265`](https://github.com/apache/spark/commit/7bdc265500cbfd6b4dc16ec6a6ce7c321e7dd3dc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16245: [SPARK-18824][SQL] Add optimizer rule to reorder Filter ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16245 I think most of time it should be as Scala UDF needs extra conversion between internal format and external format on input and out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16245: [SPARK-18824][SQL] Add optimizer rule to reorder Filter ...
Github user chenghao-intel commented on the issue: https://github.com/apache/spark/pull/16245 Actually I doubt this is really an optimization, as the assumption of Scala UDF is slower than the non-SCALA UDF probably not always true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16245: [SPARK-18824][SQL] Add optimizer rule to reorder ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16245#discussion_r97211371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -514,6 +514,34 @@ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { /** + * Reorders the predicates in `Filter` so more expensive expressions like UDF can evaluate later. + */ +object ReorderPredicatesInFilter extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(pred, child) => + // Extracts deterministic suffix expressions from Filter predicate. + val expressions = splitConjunctivePredicates(pred) + // The beginning index of the deterministic suffix expressions. + var splitIndex = -1 + (expressions.length - 1 to 0 by -1).foreach { idx => +if (splitIndex == -1 && !expressions(idx).deterministic) { + splitIndex = idx + 1 +} + } + if (splitIndex == expressions.length) { +// All expressions are non-deterministic, no reordering. +f + } else { +val (nonDeterminstics, deterministicExprs) = expressions.splitAt(splitIndex) --- End diff -- Reordering non-deterministic expressions might change the evaluation results. I think `foldable` expressions are handled by other rule already. And I remember we don't have explicit `stateful` expression or it is classified as `non-deterministic` too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][WIP] R should check for java when ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16596 I've found the root cause, from investigations, but need to test cross platform for the fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16245: [SPARK-18824][SQL] Add optimizer rule to reorder ...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/16245#discussion_r97211330 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -514,6 +514,34 @@ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { /** + * Reorders the predicates in `Filter` so more expensive expressions like UDF can evaluate later. + */ +object ReorderPredicatesInFilter extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(pred, child) => + // Extracts deterministic suffix expressions from Filter predicate. + val expressions = splitConjunctivePredicates(pred) + // The beginning index of the deterministic suffix expressions. + var splitIndex = -1 + (expressions.length - 1 to 0 by -1).foreach { idx => +if (splitIndex == -1 && !expressions(idx).deterministic) { + splitIndex = idx + 1 +} + } + if (splitIndex == expressions.length) { +// All expressions are non-deterministic, no reordering. +f + } else { +val (nonDeterminstics, deterministicExprs) = expressions.splitAt(splitIndex) --- End diff -- I am a little confused why we need to separate the `non-deterministic`? Should be the `stateful` or `foldable`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16624 The final failure, `HiveSparkSubmitSuite.dir` is irrelevant to this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16624 **[Test build #71791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71791/testReport)** for PR 16624 at commit [`075f466`](https://github.com/apache/spark/commit/075f4667020438a650659197ac8212c785775e75). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16579: [SPARK-19218][SQL] Fix SET command to show a result corr...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16579 Hi, @srowen and @gatorsmile . Finally, this PR resolved all issues. Could you review this again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16670#discussion_r97211236 --- Diff: R/pkg/R/utils.R --- @@ -756,12 +756,12 @@ varargsToJProperties <- function(...) { props } -launchScript <- function(script, combinedArgs, capture = FALSE) { +launchScript <- function(script, combinedArgs, wait = FALSE) { if (.Platform$OS.type == "windows") { scriptWithArgs <- paste(script, combinedArgs, sep = " ") -shell(scriptWithArgs, translate = TRUE, wait = capture, intern = capture) # nolint +shell(scriptWithArgs, translate = TRUE, wait = wait, intern = wait) # nolint } else { -system2(script, combinedArgs, wait = capture, stdout = capture) +system2(script, combinedArgs, wait = wait) --- End diff -- http://www.astrostatistics.psu.edu/datasets/R/html/base/html/shell.html on Windows, intern = F seems to mean output to the console. (doc page is missing on stat.ethz.ch) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16624: [WIP] Fix `SET -v` not to raise exceptions for configs w...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16624 Retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is getting...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16670 **[Test build #71790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71790/testReport)** for PR 16670 at commit [`294ce99`](https://github.com/apache/spark/commit/294ce991d2e1c8d7a38b526ccf4f35a7ac41fbc1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16670#discussion_r97211194 --- Diff: R/pkg/R/utils.R --- @@ -756,12 +756,12 @@ varargsToJProperties <- function(...) { props } -launchScript <- function(script, combinedArgs, capture = FALSE) { +launchScript <- function(script, combinedArgs, wait = FALSE) { if (.Platform$OS.type == "windows") { scriptWithArgs <- paste(script, combinedArgs, sep = " ") -shell(scriptWithArgs, translate = TRUE, wait = capture, intern = capture) # nolint +shell(scriptWithArgs, translate = TRUE, wait = wait, intern = wait) # nolint } else { -system2(script, combinedArgs, wait = capture, stdout = capture) +system2(script, combinedArgs, wait = wait) --- End diff -- http://stat.ethz.ch/R-manual/R-devel/library/base/html/system2.html stdout = F means "discard output" stdout = "" (default) means to the console --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16670: [SPARK-19324][SPARKR] Spark VJM stdout output is ...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/16670 [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$count@jc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$count@jc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex example with calls of `println` in Scala/JVM side. ## How was this patch tested? manual You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rjvmstdout Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16670.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16670 commit 294ce991d2e1c8d7a38b526ccf4f35a7ac41fbc1 Author: Felix CheungDate: 2017-01-22T02:14:06Z do not drop stdout --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org