[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84371/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19758#discussion_r154284858 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeSplitUtilsSuite.scala --- @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.tree.{CategoricalSplit, ContinuousSplit, Split} +import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.mllib.tree.impurity.{Entropy, Impurity} +import org.apache.spark.mllib.tree.model.ImpurityStats +import org.apache.spark.mllib.util.MLlibTestSparkContext + +/** Suite exercising helper methods for making split decisions during decision tree training. */ +class TreeSplitUtilsSuite + extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + /** + * Get a DTStatsAggregator for sufficient stat collection/impurity calculation populated + * with the data from the specified training points. Assumes a feature index of 0 and that + * all training points have the same weights (1.0). + */ + private def getAggregator( + metadata: DecisionTreeMetadata, + values: Array[Int], + labels: Array[Double], + featureSplits: Array[Split]): DTStatsAggregator = { +// Create stats aggregator +val statsAggregator = new DTStatsAggregator(metadata, featureSubset = None) +// Update parent impurity stats +val featureIndex = 0 +val instanceWeights = Array.fill[Double](values.length)(1.0) +AggUpdateUtils.updateParentImpurity(statsAggregator, indices = values.indices.toArray, + from = 0, to = values.length, instanceWeights, labels) +// Update current aggregator's impurity stats +values.zip(labels).foreach { case (value: Int, label: Double) => + if (metadata.isUnordered(featureIndex)) { +AggUpdateUtils.updateUnorderedFeature(statsAggregator, value, label, + featureIndex = featureIndex, featureIndexIdx = 0, featureSplits, instanceWeight = 1.0) + } else { +AggUpdateUtils.updateOrderedFeature(statsAggregator, value, label, featureIndexIdx = 0, + instanceWeight = 1.0) + } +} +statsAggregator + } + + /** + * Check that left/right impurities match what we'd expect for a split. + * @param labels Labels whose impurity information should be reflected in stats + * @param stats ImpurityStats object containing impurity info for the left/right sides of a split + */ + private def validateImpurityStats( + impurity: Impurity, + labels: Array[Double], + stats: ImpurityStats, + expectedLeftStats: Array[Double], + expectedRightStats: Array[Double]): Unit = { +// Compute impurity for our data points manually +val numClasses = (labels.max + 1).toInt +val fullImpurityStatsArray + = Array.tabulate[Double](numClasses)((label: Int) => labels.count(_ == label).toDouble) +val fullImpurity = Entropy.calculate(fullImpurityStatsArray, labels.length) +// Verify that impurity stats were computed correctly for split +assert(stats.impurityCalculator.stats === fullImpurityStatsArray) +assert(stats.impurity === fullImpurity) +assert(stats.leftImpurityCalculator.stats === expectedLeftStats) +assert(stats.rightImpurityCalculator.stats === expectedRightStats) +assert(stats.valid) + } + + /* * * * * * * * * * * Choosing Splits * * * * * * * * * * */ + + test("chooseSplit: choose correct type of split (continuous split)") { +// Construct (binned) continuous data +val labels = Array(0.0, 0.0, 1.0) +val values = Array(1, 2, 3) +val featureIndex = 0 +// Get an array of continuous splits corresponding to values in our binned data +val splits =
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84371 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84371/testReport)** for PR 19813 at commit [`aa3db2e`](https://github.com/apache/spark/commit/aa3db2edca66ab04ecb8fbd54750cbd46544eb1d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19840 I'm a little concerned about such changes, this may be misconfigured to introduce the discrepancy between driver python and executor python, at least we should honor this configuration "spark.executorEnv.PYSPARK_PYTHON" unless the executables of `PYSPARK_PYTHON` sent by driver is not existed. (Just my two cents) ping @zjffdu , any thought? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 Yes, you are right. we should use same python executables. But the **same** might mean binary same not just same path --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19857 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84373/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19857 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19857 **[Test build #84373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84373/testReport)** for PR 19857 at commit [`c6f2250`](https://github.com/apache/spark/commit/c6f225025a1ba002b6aa4ce83fb67dbe742395b1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 @gczsjdy Added a simple example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19840 Oh, I see. You're running in client mode. So this one `--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python` is useless. So I guess the behavior is expected. Because driver will honor `PYSPARK_PYTHON` env and ship it to executors. So the cluster will use same python executables. With your above test, `/path/to/python` is different for driver and executors, will it bring in issues? Driver uses `PYSPARK_PYTHON` and executors uses `spark.executorEnv.PYSPARK_PYTHON` which points to different paths. Normally I think we don't need to set PYSPARK_PYTHON in executor side. Please correct me if I'm wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19857 **[Test build #84373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84373/testReport)** for PR 19857 at commit [`c6f2250`](https://github.com/apache/spark/commit/c6f225025a1ba002b6aa4ce83fb67dbe742395b1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84370/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84370/testReport)** for PR 19813 at commit [`98850a6`](https://github.com/apache/spark/commit/98850a67c99b02b9ba98ec544f7d0d5671142716). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19824: [SPARK][STREAMING] Invoke onBatchCompletion() only when ...
Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/19824 `What I want to say is that if a Job is failed, we should consider the Batch as not completed.` isn't #16542 doing the same thing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19857 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84372/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19857 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19857 **[Test build #84372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84372/testReport)** for PR 19857 at commit [`980c8ec`](https://github.com/apache/spark/commit/980c8ec87ddbc9f938942e78bb4cfe9753722bd2). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19857 **[Test build #84372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84372/testReport)** for PR 19857 at commit [`980c8ec`](https://github.com/apache/spark/commit/980c8ec87ddbc9f938942e78bb4cfe9753722bd2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 https://user-images.githubusercontent.com/8326978/33471349-e570953e-d6a7-11e7-9fec-74963efe37d2.png;> @jerryshao ENVs are specified ok by yarn, but the `pythonExec` is generated in `PythonRDD` at driver side and delivered to executor side within a closure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19857: [SPARK-22667][ML] Fix model-specific optimization suppor...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19857 @MrBago @jkbradley I think this PR need to be reviewed and merged first, before reviewing #19627 Because this PR change some critical code path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19857: [SPARK-22667][ML] Fix model-specific optimization...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19857 [SPARK-22667][ML] Fix model-specific optimization support for ML tuning: Python API ## What changes were proposed in this pull request? Python CrossValidator/TrainValidationSplit: With base Estimator implemented in Scala/Java â Convert base Estimator to Scala/Java object, and call the JVM fit() (as in Weichenâs comment) With base Estimator implemented in Python â Python needs the same machinery for multi-model fitting and parallelism as Scala. We can call directly into it. New API added: ``` class Estimator: def parallelFit(self, dataset, paramMaps, threadPool, modelCallback): ``` **Note** This PR also fix the `# TODO: persist average metrics as well` in CV/TVS. Because the testsuite need to check consistency of `avgMetrics` so this need to be fixed. If this need backport to old spark version, I can split it to a separate PR. ## How was this patch tested? Existing UT already covers each code paths which need test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix_model_spec_optim_py Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19857.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19857 commit 980c8ec87ddbc9f938942e78bb4cfe9753722bd2 Author: WeichenXuDate: 2017-11-30T10:08:55Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19717 @vanzin @mridulm @jerryshao would love your comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19717 will do a pass on the latest over the weekend --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19717 /cc @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19054 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19054 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84368/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19054 **[Test build #84368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84368/testReport)** for PR 19054 at commit [`69e288e`](https://github.com/apache/spark/commit/69e288efda8dbaab667f0e9ed6c7f2cb811f79b1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19840 I think in YARN we have several different ways to set `PYSPARK_PYTHON`, I guess your issue is that which one should take priority? Can you please: 1. Define a consistent ordering for such envs, which one should take priority (spark.yarn.appMasterEnv.XXX or XXX), and document it. 2. Check if it works as expected for `spark.yarn.appMasterEnv.PYSPARK_PYTHON` and `spark.executorEnv.PYSPARK_PYTHON`. I guess for other envs, it will also suffer from this problem, am I right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84367/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84367/testReport)** for PR 19813 at commit [`0d358d6`](https://github.com/apache/spark/commit/0d358d635494199582aa6e38fdbeec0f6446c029). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18324: [SPARK-21045][PYSPARK]Fixed executor blocked beca...
Github user dataknocker commented on a diff in the pull request: https://github.com/apache/spark/pull/18324#discussion_r154274837 --- Diff: python/pyspark/worker.py --- @@ -177,8 +180,11 @@ def process(): process() except Exception: try: +exc_info = traceback.format_exc() +if isinstance(exc_info, unicode): +exc_info = exc_info.encode('utf-8') --- End diff -- @HyukjinKwon this pr can only be hanging? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19856 >I think the log can't reflect the behavior of consumer connection,because consumer.create doesn't do any connect,it only construct a ZookeeperConsumerConnector instance That's not true, `ZookeeperConsumerConnector` will also connect to ZK during initialization, please see here (https://github.com/apache/kafka/blob/c9f930e3fe25e5675e64550c8b396356c5349ca7/core/src/main/scala/kafka/consumer/ZookeeperConsumerConnector.scala#L126). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19856 Actually there's no issue here, IMHO I think your understanding of this log is slightly different from the original purpose. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user liu-zhaokun commented on the issue: https://github.com/apache/spark/pull/19856 @jerryshao I think the log can't reflect the behavior of consumer connection,because consumer.create doesn't do any connect,it only construct a ZookeeperConsumerConnector instance,so the right position is where zkclient connect to zk. ReliableKafkaReceiver is not marked deprecated,so users will use it.To avoid misleading by this log when a problem of connect to zk occur ,I think we should fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r154272738 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +153,62 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints. + * + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join --- End diff -- It sounds promising. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19856 I guess the original purpose of such log is to reflect the behavior of consumer connection. It is not super necessary to do such trivial change. Also `ReliableKafkaReceiver` is not recommended any more, let's keep as it is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user liu-zhaokun commented on the issue: https://github.com/apache/spark/pull/19856 @srowen Please help merge this PR as it has passed all tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 i guess specifing `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` overwrites spark.executorEnv.PYSPARK_PYTHON by [context.py#L156](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L156), where should use `setIfMissing` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154270654 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- just move this condition after the length check: ``` if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1) && (value == histogram.bins(i + 1).lo)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84365/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19831 Yes, I saw some of these tables in my cluster, but the user did not manually modify this parameter: ``` # Detailed Table Information Databasedw Table prod Owner bi Created TimeTue Nov 03 16:33:52 CST 2015 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.2 or prior TypeEXTERNAL Providerhive Comment Product list Table Properties[transient_lastDdlTime=1508260780, last_modified_time=1473154014, last_modified_by=bi] Statistics 26596461123 bytes, 0 rows Locationviewfs://cluster9/user/hive/warehouse/dw.db/prod Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties [serialization.format=1] Partition Provider Catalog Time taken: 1.241 seconds, Fetched 70 row(s) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84365/testReport)** for PR 19783 at commit [`6e6c49b`](https://github.com/apache/spark/commit/6e6c49bdfef0071f18b6e9b607b7fc1bf5b49ff8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19845 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84369/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19845 **[Test build #84369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84369/testReport)** for PR 19845 at commit [`0c7e537`](https://github.com/apache/spark/commit/0c7e5378a491f01b14df23fe56ae50a0f52971ee). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19845 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19631 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84371/testReport)** for PR 19813 at commit [`aa3db2e`](https://github.com/apache/spark/commit/aa3db2edca66ab04ecb8fbd54750cbd46544eb1d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19845 **[Test build #84369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84369/testReport)** for PR 19845 at commit [`0c7e537`](https://github.com/apache/spark/commit/0c7e5378a491f01b14df23fe56ae50a0f52971ee). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84370/testReport)** for PR 19813 at commit [`98850a6`](https://github.com/apache/spark/commit/98850a67c99b02b9ba98ec544f7d0d5671142716). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154267815 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) + +val subExprs = getSubExprInChildren(ctx, expr) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) + +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} + +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +expr.children.flatMap(getInputRows(ctx, _)).distinct + } + + /**
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154267761 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1028,12 +1053,18 @@ class CodegenContext { // 2. Less code. // Currently, we will do this for all non-leaf only expression trees (i.e. expr trees with // at least two nodes) as the cost of doing it is expected to be low. - addMutableState(JAVA_BOOLEAN, isNull, s"$isNull = false;") + if (expr.nullable) { +addMutableState(JAVA_BOOLEAN, isNull) + } addMutableState(javaType(expr.dataType), value, s"$value = ${defaultValue(expr.dataType)};") --- End diff -- Yeah, I think so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19823 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19823 Thanks! Merged to master/2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154266910 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1028,12 +1053,18 @@ class CodegenContext { // 2. Less code. // Currently, we will do this for all non-leaf only expression trees (i.e. expr trees with // at least two nodes) as the cost of doing it is expected to be low. - addMutableState(JAVA_BOOLEAN, isNull, s"$isNull = false;") + if (expr.nullable) { +addMutableState(JAVA_BOOLEAN, isNull) + } addMutableState(javaType(expr.dataType), value, s"$value = ${defaultValue(expr.dataType)};") --- End diff -- Is this initialization `s"$value = ${defaultValue(expr.dataType)};"` necessary? I think `$value` will be always referred after `$value` is defined by an assignment generated at line 1039. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154266575 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) + +val subExprs = getSubExprInChildren(ctx, expr) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) + +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} + +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +expr.children.flatMap(getInputRows(ctx, _)).distinct + } + + /** +
[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19854#discussion_r154266377 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala --- @@ -424,7 +426,9 @@ case class HiveScriptIOSchema ( recordReaderClass.map { klass => val instance = Utils.classForName(klass).newInstance().asInstanceOf[RecordReader] val props = new Properties() - props.putAll(outputSerdeProps.toMap.asJava) + // props.putAll(outputSerdeProps.toMap.asJava) + // see https://github.com/apache/kafka/pull/3647 --- End diff -- ditto. I think it is better to remove commented code and refer to https://github.com/scala/bug/issues/10418. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19845#discussion_r154266335 --- Diff: python/pyspark/ml/tests.py --- @@ -1837,6 +1837,29 @@ def test_read_images(self): self.assertEqual(ImageSchema.undefinedImageType, "Undefined") +class ImageReaderTest2(PySparkTestCase): + +@classmethod +def setUpClass(cls): +PySparkTestCase.setUpClass() +try: +cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf() +except py4j.protocol.Py4JError: +cls.tearDownClass() +raise unittest.SkipTest("Hive is not available") +except TypeError: +cls.tearDownClass() +raise unittest.SkipTest("Hive is not available") +cls.spark = HiveContext._createForTesting(cls.sc) + --- End diff -- Sure, that should be safer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19854#discussion_r154266308 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala --- @@ -412,7 +412,9 @@ case class HiveScriptIOSchema ( propsMap = propsMap + (serdeConstants.LIST_COLUMN_TYPES -> columnTypesNames) val properties = new Properties() -properties.putAll(propsMap.asJava) +// properties.putAll(propsMap.asJava) +// see https://github.com/apache/kafka/pull/3647 +propsMap.foreach{ case (k, v) => properties.put(k, v) } --- End diff -- And we should not leave commented code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84364/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19854#discussion_r154266187 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -296,7 +296,8 @@ class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends L props.put("replica.socket.timeout.ms", "1500") props.put("delete.topic.enable", "true") props.put("offsets.topic.num.partitions", "1") -props.putAll(withBrokerProps.asJava) +// props.putAll(withBrokerProps.asJava) +withBrokerProps.foreach{ case (k, v) => props.put(k, v) } --- End diff -- https://github.com/scala/bug/issues/10418. I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19854#discussion_r154266242 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala --- @@ -412,7 +412,9 @@ case class HiveScriptIOSchema ( propsMap = propsMap + (serdeConstants.LIST_COLUMN_TYPES -> columnTypesNames) val properties = new Properties() -properties.putAll(propsMap.asJava) +// properties.putAll(propsMap.asJava) +// see https://github.com/apache/kafka/pull/3647 +propsMap.foreach{ case (k, v) => properties.put(k, v) } --- End diff -- I think the direct reference should be https://github.com/scala/bug/issues/10418. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84364 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84364/testReport)** for PR 19811 at commit [`38fd079`](https://github.com/apache/spark/commit/38fd079c8257457eda741f808e0d00cedc6fdf02). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19854: SPARK-22660:Use position() and limit() to fix amb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19854#discussion_r154265720 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -296,7 +296,8 @@ class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends L props.put("replica.socket.timeout.ms", "1500") props.put("delete.topic.enable", "true") props.put("offsets.topic.num.partitions", "1") -props.putAll(withBrokerProps.asJava) +// props.putAll(withBrokerProps.asJava) +withBrokerProps.foreach{ case (k, v) => props.put(k, v) } --- End diff -- Is this related? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user sujith71955 commented on the issue: https://github.com/apache/spark/pull/19823 Thanks all for the review and guidance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19845 @HyukjinKwon Thanks. I forgot the Hive support is needed to test it. The added test looks good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19845#discussion_r154262931 --- Diff: python/pyspark/ml/tests.py --- @@ -1837,6 +1837,29 @@ def test_read_images(self): self.assertEqual(ImageSchema.undefinedImageType, "Undefined") +class ImageReaderTest2(PySparkTestCase): + +@classmethod +def setUpClass(cls): +PySparkTestCase.setUpClass() +try: +cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf() +except py4j.protocol.Py4JError: +cls.tearDownClass() +raise unittest.SkipTest("Hive is not available") +except TypeError: +cls.tearDownClass() +raise unittest.SkipTest("Hive is not available") +cls.spark = HiveContext._createForTesting(cls.sc) + --- End diff -- Add classmethod `tearDownClass` to stop the `cls.spark`? I didn't see `HiveContextSQLTests` closes it anyway, maybe we can fix it too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19054 **[Test build #84368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84368/testReport)** for PR 19054 at commit [`69e288e`](https://github.com/apache/spark/commit/69e288efda8dbaab667f0e9ed6c7f2cb811f79b1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19631 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19631 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84363/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19631 **[Test build #84363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84363/testReport)** for PR 19631 at commit [`c752453`](https://github.com/apache/spark/commit/c752453b2a379f301d52692bfc639bb631520069). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84367/testReport)** for PR 19813 at commit [`0d358d6`](https://github.com/apache/spark/commit/0d358d635494199582aa6e38fdbeec0f6446c029). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 @mgaido91 Thanks for the review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154261964 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) --- End diff -- Moved. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19848 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19848 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84362/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19054 **[Test build #84366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84366/testReport)** for PR 19054 at commit [`b0db6aa`](https://github.com/apache/spark/commit/b0db6aafd3cb172062220aeeae087be9ad5b99e5). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19054 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84366/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19848 **[Test build #84362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84362/testReport)** for PR 19848 at commit [`92f9180`](https://github.com/apache/spark/commit/92f9180b0fea71ff3ae4aa3049e04d6f1e3167be). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19054 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19054: [SPARK-18067] Avoid shuffling child if join keys are sup...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19054 **[Test build #84366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84366/testReport)** for PR 19054 at commit [`b0db6aa`](https://github.com/apache/spark/commit/b0db6aafd3cb172062220aeeae087be9ad5b99e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 @ueshin cluster mode working, client not --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19850: [SPARK-22653] executorAddress registered in Coars...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19850 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19850: [SPARK-22653] executorAddress registered in CoarseGraine...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19850 good catch! merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154257537 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -115,9 +116,35 @@ abstract class Expression extends TreeNode[Expression] { } } + /** + * Records current input row and variables for this expression into created `ExprCode`. + */ + private def populateInputs(ctx: CodegenContext, eval: ExprCode): Unit = { +if (ctx.INPUT_ROW != null) { + eval.inputRow = ctx.INPUT_ROW +} +if (ctx.currentVars != null) { + val boundRefs = this.collect { +case b @ BoundReference(ordinal, _, _) if ctx.currentVars(ordinal) != null => (ordinal, b) + }.toMap + + ctx.currentVars.zipWithIndex.filter(_._1 != null).foreach { case (currentVar, idx) => --- End diff -- ctx.INPUT_ROW can possibly be null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154257472 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r154256590 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +153,62 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints. + * + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join --- End diff -- can we apply this optimization to all joins after https://github.com/apache/spark/pull/19054? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84365 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84365/testReport)** for PR 19783 at commit [`6e6c49b`](https://github.com/apache/spark/commit/6e6c49bdfef0071f18b6e9b607b7fc1bf5b49ff8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user rezasafi commented on a diff in the pull request: https://github.com/apache/spark/pull/19848#discussion_r154256205 --- Diff: core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala --- @@ -70,7 +70,8 @@ object SparkHadoopMapRedUtil extends Logging { if (shouldCoordinateWithDriver) { val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator val taskAttemptNumber = TaskContext.get().attemptNumber() -val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, taskAttemptNumber) +val stageId = TaskContext.get().stageId() +val canCommit = outputCommitCoordinator.canCommit(stageId, splitId, taskAttemptNumber) --- End diff -- Removing jobId from the signature of commitTask will cause a binary incompatibility error, since commitTask here is a public method. So we will ended up with a parameter that will stay unused. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19850: [SPARK-22653] executorAddress registered in CoarseGraine...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19850 Please update the PR title: ``` [SPARK-22653][CORE] executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap should not be null ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154255068 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 5) } + // The following test cases have histogram information collected for the test column + test("Not(cintHgm < 3 AND null)") { +val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)), + expectedRowCount = 9) + } + + test("cintHgm = 5") { +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm = 0") { +// This is an out-of-range case since 0 is outside the range [min, max] +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm < 3") { +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm < 0") { +// This is a corner case since literal 0 is smaller than min. +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm <= 3") { +validateEstimatedStats( + Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 6") { +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 10") { +// This is a corner case since max value is 10. +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm >= 6") { +validateEstimatedStats( + Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm IS NULL") { +validateEstimatedStats( + Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm IS NOT NULL") { +validateEstimatedStats( + Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 10) + } + + test("cintHgm > 3 AND cintHgm <= 6") { +val condition = And(GreaterThan(attrIntHgm, + Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = Some(6), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 8) + } + + test("cintHgm = 3 OR cintHgm = 6") { --- End diff -- We have added histogram test cases for skewed distribution. I will add more histogram test cases for non-skewed distribution. --- - To unsubscribe,
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154254419 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- My earlier comment mentioned this test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154254145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt) : AttributeMap[ColumnStat] = { val newColumnStats = originalMap.map { case (attr, oriColStat) => - // Update ndv based on the overall filter selectivity: scale down ndv if the number of rows - // decreases; otherwise keep it unchanged. - val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter, -newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount) val colStat = updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat) + val newNdv = if (colStat.distinctCount > 1) { --- End diff -- The old code does not work well for a couple of new skewed-distribution tests. For example, test("cintHgm < 3") would fail. Because it still computes to find newNdv in updateNdv() method. But, in reality, we already scale it down to 1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84364 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84364/testReport)** for PR 19811 at commit [`38fd079`](https://github.com/apache/spark/commit/38fd079c8257457eda741f808e0d00cedc6fdf02). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19824: [SPARK][STREAMING] Invoke onBatchCompletion() only when ...
Github user victor-wong commented on the issue: https://github.com/apache/spark/pull/19824 @CodingCat Yes, this PR wants to solve the same issue in https://github.com/apache/spark/pull/16542, but I think this is a better way to solve it. If a Job failed, I think we should not remove it from its JobSet, so `jobSet.hasCompleted` will return false. As a result, we will not receive a StreamingListenerBatchCompleted. What I want to say is that if a Job is failed, we should consider the Batch as not completed. I am not confident about my English, if I am not describing it clear, please let me know. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154252063 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends Logging { op match { case _: GreaterThan | _: GreaterThanOrEqual => -// If new ndv is 1, then new max must be equal to new min. -newMin = if (newNdv == 1) newMax else newValue +newMin = newValue case _: LessThan | _: LessThanOrEqual => -newMax = if (newNdv == 1) newMin else newValue +newMax = newValue --- End diff -- Previously I coded that way because of a corner test case: test("cbool > false"). At that time, I set the newMin to newMax since newNdv = 1. However, this logic does not work well for the skewed distribution test case: test ("cintHgm < 3"). In this test, newMin=1 newMax=3. I think the revised code makes better sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org