[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/15807 For wholestage codegen, I think that a life time of sub-expressions is within an iteration for a row. Thus, `isInitialized` and `subExpr1` should be initialized at the beginning of each iteration. For example, after reading a row. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15822: [Minor][PySpark] Improve error message when running PySp...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15822 yeah, as per the discussion at https://github.com/apache/spark/pull/15659#issuecomment-259157516. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13909 **[Test build #68396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68396/consoleFull)** for PR 13909 at commit [`990d6c8`](https://github.com/apache/spark/commit/990d6c8bb7f159088f1f614bf80d91a5801616cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15822: [Minor][PySpark] Improve error message when running PySp...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15822 Thanks - did this come from a discussion somewhere? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15816: SPARK-18368: Fix regexp_replace with task seriali...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15816 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15816: SPARK-18368: Fix regexp_replace with task serialization.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15816 I'm surprised too that we haven't caught this earlier ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15816: SPARK-18368: Fix regexp_replace with task serialization.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15816 Merging in master/branch-2.1/branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15823 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15823 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68391/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15823 **[Test build #68391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68391/consoleFull)** for PR 15823 at commit [`4e79c37`](https://github.com/apache/spark/commit/4e79c377f99602adb5c3fd82f57de1773bb9b64d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15627: [SPARK-18099][YARN] Fail if same files added to distribu...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/15627 @kishorvpatil Thank you for fixing this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15807 even we modify it to hold the results of subexpressions in member variables, the above code example should not work under wholestage codegen. The above code example is similar to non wholestage codegen subexpression elimination in fact, which is also using functions to wrap subexpression evaluations. But for wholestage codegen, as we might evaluate expressions against input row or local variables, the function approach can't work due to these local variables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15807 why whole stage codegen can't use member variables to keep the result of subexpression? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15799: [SPARK-18333] [SQL] Revert hacks in parquet and o...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15799 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15823 LGTM. Not at my laptop, would be great if you can merge @rxin, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15799: [SPARK-18333] [SQL] Revert hacks in parquet and orc read...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15799 merging to master/2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15825 **[Test build #68395 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68395/consoleFull)** for PR 15825 at commit [`1200e2d`](https://github.com/apache/spark/commit/1200e2d47cbb0b6d67950e2ad3512798395d65cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15807 > isn't the result of subexpression kept in member variables? For non-wholestage codegen, yes. For wholestage codegen, no. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r87141007 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +825,44 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will create the partition location with +// a default path generate by the new spec with lower cased partition column names. This is +// unexpected and we need to rename them manually and alter the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tablePath = new Path(tableMeta.location) + val fs = tablePath.getFileSystem(hadoopConf) + val newParts = newSpecs.map { spec => +val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) +val wrongPath = new Path(partition.storage.locationUri.get) +val rightPath = ExternalCatalogUtils.generatePartitionPath( + spec, partitionColumnNames, tablePath) +try { + fs.rename(wrongPath, rightPath) +} catch { + case e: IOException => +throw new SparkException(s"Unable to rename partition path $wrongPath", e) --- End diff -- Maybe we also need to add `rightPath` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15799: [SPARK-18333] [SQL] Revert hacks in parquet and orc read...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15799 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15233 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15825 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15807 isn't the result of subexpression kept in member variables? What I am talking about is something like: ``` private boolean isInitialized = false; private Int subExpr1 = 0; private void evalSubExpr1() { ... isInitialized = true } public int getSubExpr() { if (!isInitialized) { evalSubExpr1() } subExpr1 } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15807 E.g., if (isNull(subexpr)) { ... } else { AssertNotNull(subexpr) // subexpr2, first used. SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2) } if (isNull(subexpr)) { ... } else { AssertNotNull(subexpr) // subexpr2 SomeExpr2(AssertNotNull(subexpr)) // SomeExpr2(subexpr2) } SomeExpr3(AssertNotNull(subexpr)) // SomeExpr3(subexpr2) `AssertNotNull(subexpr)` is recognized as subexpression. When it is used in the else branch at the top, it is evaluated in this branch. But `SomeExpr3(AssertNotNull(subexpr))` which also uses it can't access the evaluated subexpression result. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15807 > we can't access the subexpression outside later I don't quite understand it, can you give an example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15824 **[Test build #68394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68394/consoleFull)** for PR 15824 at commit [`548e45f`](https://github.com/apache/spark/commit/548e45f63f64a3f9e9709af39610ce50390b0fa1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15824 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15824 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15824 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68393/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15824 **[Test build #68393 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68393/consoleFull)** for PR 15824 at commit [`548e45f`](https://github.com/apache/spark/commit/548e45f63f64a3f9e9709af39610ce50390b0fa1). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87138232 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -87,25 +120,40 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = { committer.commitJob(jobContext) +val filesToMove = taskCommits.map(_.obj.asInstanceOf[Map[String, String]]).reduce(_ ++ _) +logDebug(s"Committing files staged for absolute locations $filesToMove") +val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) +for ((src, dst) <- filesToMove) { + fs.rename(new Path(src), new Path(dst)) +} +fs.delete(absPathStagingDir, true) } override def abortJob(jobContext: JobContext): Unit = { committer.abortJob(jobContext, JobStatus.State.FAILED) +val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) +fs.delete(absPathStagingDir, true) } override def setupTask(taskContext: TaskAttemptContext): Unit = { committer = setupCommitter(taskContext) committer.setupTask(taskContext) +addedAbsPathFiles = mutable.Map[String, String]() } override def commitTask(taskContext: TaskAttemptContext): TaskCommitMessage = { val attemptId = taskContext.getTaskAttemptID SparkHadoopMapRedUtil.commitTask( committer, taskContext, attemptId.getJobID.getId, attemptId.getTaskID.getId) -EmptyTaskCommitMessage +new TaskCommitMessage(addedAbsPathFiles.toMap) --- End diff -- Yea it can go either way. Unclear which one is better. Renaming on job commit gives higher chance of corrupting data, whereas renaming in task commit is slightly more performant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87138139 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -87,25 +120,40 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = { committer.commitJob(jobContext) +val filesToMove = taskCommits.map(_.obj.asInstanceOf[Map[String, String]]).reduce(_ ++ _) +logDebug(s"Committing files staged for absolute locations $filesToMove") +val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) +for ((src, dst) <- filesToMove) { + fs.rename(new Path(src), new Path(dst)) +} +fs.delete(absPathStagingDir, true) } override def abortJob(jobContext: JobContext): Unit = { committer.abortJob(jobContext, JobStatus.State.FAILED) +val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) +fs.delete(absPathStagingDir, true) } override def setupTask(taskContext: TaskAttemptContext): Unit = { committer = setupCommitter(taskContext) committer.setupTask(taskContext) +addedAbsPathFiles = mutable.Map[String, String]() } override def commitTask(taskContext: TaskAttemptContext): TaskCommitMessage = { val attemptId = taskContext.getTaskAttemptID SparkHadoopMapRedUtil.commitTask( committer, taskContext, attemptId.getJobID.getId, attemptId.getTaskID.getId) -EmptyTaskCommitMessage +new TaskCommitMessage(addedAbsPathFiles.toMap) --- End diff -- Why don't we just rename temp files to dest files in commitTask? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15803: [SPARK-18298][Web UI]change gmt time to local zone time ...
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/spark/pull/15803 I agree with showing the timezone with date string. But always using GMT/UTC time is not a good choice, logs of application(using log4j) usually are printed using local timezone(like CST). That means if I wanna check what happens using application logs with EventLog, I must do the translation between them manually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15807 @cloud-fan Then once the first expression to use the subexpression is in a if/else branch, we can't access the subexpression outside later. Evaluate it again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15824 **[Test build #68393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68393/consoleFull)** for PR 15824 at commit [`548e45f`](https://github.com/apache/spark/commit/548e45f63f64a3f9e9709af39610ce50390b0fa1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15824 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15825 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68390/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15825 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15825 **[Test build #68390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68390/consoleFull)** for PR 15825 at commit [`1200e2d`](https://github.com/apache/spark/commit/1200e2d47cbb0b6d67950e2ad3512798395d65cf). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15814 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68388/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15814 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15814 **[Test build #68388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68388/consoleFull)** for PR 15814 at commit [`91f87de`](https://github.com/apache/spark/commit/91f87de270bbd743f0b40b82016391706a936ca1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15814 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15814 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68387/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15814 **[Test build #68387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68387/consoleFull)** for PR 15814 at commit [`4296612`](https://github.com/apache/spark/commit/4296612bd373bc1118b2cbfc10530b1895aaef60). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15748: [SPARK-18240][ML][PySpark] Add Summary of BiKMean...
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/15748 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15233 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15233 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68386/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15233 **[Test build #68386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68386/consoleFull)** for PR 15233 at commit [`c6d3acd`](https://github.com/apache/spark/commit/c6d3acd953ad70f6d66252d13e6cf4c0e507d33c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87135950 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -226,6 +238,34 @@ case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] { insertCmd } + + /** + * Given a set of input partitions, returns those that have locations that differ from the + * Hive default (e.g. /k1=v1/k2=v2). These partitions were manually assigned locations by + * the user. + * + * @return a mapping from partition specs to their custom locations + */ + private def getCustomPartitionLocations( + spark: SparkSession, + table: CatalogTable, + basePath: Path, + partitions: Seq[CatalogTablePartition]): Map[TablePartitionSpec, String] = { +val hadoopConf = spark.sessionState.newHadoopConf +val fs = basePath.getFileSystem(hadoopConf) +val qualifiedBasePath = basePath.makeQualified(fs.getUri, fs.getWorkingDirectory) +partitions.flatMap { p => + val defaultLocation = qualifiedBasePath.suffix( +"/" + PartitioningUtils.getPathFragment(p.spec, table.partitionSchema)).toString + val catalogLocation = new Path(p.storage.locationUri.get).makeQualified( +fs.getUri, fs.getWorkingDirectory).toString + if (catalogLocation != defaultLocation) { --- End diff -- Why we distinguish partition locations that equal to default location? Partitions always have locations(custom specified or generated by default), do we really need to care about who set it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15823 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68384/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15823 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15823 **[Test build #68384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68384/consoleFull)** for PR 15823 at commit [`51d1fbe`](https://github.com/apache/spark/commit/51d1fbe464fa49f4503435f0f10790d8f1ec35ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15814 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68385/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15814 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15814 **[Test build #68385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68385/consoleFull)** for PR 15814 at commit [`fbd7b42`](https://github.com/apache/spark/commit/fbd7b425290644fcfdcaf92491aaae2ac67eefb9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15797: [SPARK-17990][SPARK-18302][SQL] correct several partitio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15797 **[Test build #68392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68392/consoleFull)** for PR 15797 at commit [`9dbc3f1`](https://github.com/apache/spark/commit/9dbc3f12f85c376c813d0eb2ea12854e9afe3853). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15823 **[Test build #68391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68391/consoleFull)** for PR 15823 at commit [`4e79c37`](https://github.com/apache/spark/commit/4e79c377f99602adb5c3fd82f57de1773bb9b64d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15824 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68389/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15824 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SPARK-18376][SQL] Skip subexpression elimination for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15824 **[Test build #68389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68389/consoleFull)** for PR 15824 at commit [`548e45f`](https://github.com/apache/spark/commit/548e45f63f64a3f9e9709af39610ce50390b0fa1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL].Staging directory fail to be removed
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15819 Can you add some documentation? The current code is very difficult to follow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputF...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15823 LGTM otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15823: [SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15823#discussion_r87134013 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -42,7 +43,13 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) @transient private var committer: OutputCommitter = _ protected def setupCommitter(context: TaskAttemptContext): OutputCommitter = { -context.getOutputFormatClass.newInstance().getOutputCommitter(context) +val format = context.getOutputFormatClass.newInstance --- End diff -- use `newInstance()` since it clearly has side effects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 @cloud-fan yes, we could take the same approach as #15044. When I have just implement it in my local environment, it can achieve similar performance improvement. I will submit that approach to #13909 within next few hours, then cc you and @hvanhovell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15807 can we just evaluate subexpression like a scala lazy val? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15825 CC @yhuai @rxin @srown @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15825: [SPARK-18377][SQL] warehouse path should be a static con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15825 **[Test build #68390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68390/consoleFull)** for PR 15825 at commit [`1200e2d`](https://github.com/apache/spark/commit/1200e2d47cbb0b6d67950e2ad3512798395d65cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15825: [SPARK-18377][SQL] warehouse path should be a sta...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/15825 [SPARK-18377][SQL] warehouse path should be a static conf ## What changes were proposed in this pull request? it's weird that every session can set its own warehouse path at runtime, we should forbid it and make it a static conf. ## How was this patch tested? existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark warehouse Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15825.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15825 commit 1200e2d47cbb0b6d67950e2ad3512798395d65cf Author: Wenchen FanDate: 2016-11-09T04:43:34Z warehouse path should be a static conf --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15262: [SPARK-17690][STREAMING][SQL] Add mini-dfs cluste...
Github user ScrapCodes closed the pull request at: https://github.com/apache/spark/pull/15262 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15262: [SPARK-17690][STREAMING][SQL] Add mini-dfs cluster based...
Github user ScrapCodes commented on the issue: https://github.com/apache/spark/pull/15262 I was going to close this for now. @srowen Those deps should not have changed, I have not added anything to the compile scope. I have not analyzed the working of those deps generation, do you have an understanding why would that happen ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SQL] Skip subexpression elimination for conditional exp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15824 **[Test build #68389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68389/consoleFull)** for PR 15824 at commit [`548e45f`](https://github.com/apache/spark/commit/548e45f63f64a3f9e9709af39610ce50390b0fa1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15824: [SQL] Skip subexpression elimination for conditional exp...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15824 cc @cloud-fan @kiszk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15824: [SQL] Skip subexpression elimination for conditio...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/15824 [SQL] Skip subexpression elimination for conditional expressions ## What changes were proposed in this pull request? As per discussion at #15807, we should disallow subexpression elimination for expressions wrapped in conditional expressions such as `If`. ## How was this patch tested? Jenkins tests. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 no-subexpr-eliminate-conditionexpr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15824.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15824 commit 548e45f63f64a3f9e9709af39610ce50390b0fa1 Author: Liang-Chi HsiehDate: 2016-11-09T04:25:21Z Skip subexpression elimination for conditional expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15807: [SPARK-18147][SQL] do not fail for very complex aggregat...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/15807 @viirya @cloud-fan It looks reasonable to me that to skip subexpression elimination for the expressions wrapped in condition expressions such as `if`. This is because we have only a place at top level to put common subexpression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15563: [SPARK-16759][CORE] Add a configuration property ...
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/15563#discussion_r87131269 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2587,17 +2589,16 @@ private[spark] class CallerContext( taskId: Option[Long] = None, taskAttemptNumber: Option[Int] = None) extends Logging { - val appIdStr = if (appId.isDefined) s"_${appId.get}" else "" - val appAttemptIdStr = if (appAttemptId.isDefined) s"_${appAttemptId.get}" else "" - val jobIdStr = if (jobId.isDefined) s"_JId_${jobId.get}" else "" - val stageIdStr = if (stageId.isDefined) s"_SId_${stageId.get}" else "" - val stageAttemptIdStr = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" - val taskIdStr = if (taskId.isDefined) s"_TId_${taskId.get}" else "" - val taskAttemptNumberStr = - if (taskAttemptNumber.isDefined) s"_${taskAttemptNumber.get}" else "" - - val context = "SPARK_" + from + appIdStr + appAttemptIdStr + - jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + taskAttemptNumberStr + val context = "SPARK_" + --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15814 **[Test build #68388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68388/consoleFull)** for PR 15814 at commit [`91f87de`](https://github.com/apache/spark/commit/91f87de270bbd743f0b40b82016391706a936ca1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87109861 --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala --- @@ -86,6 +86,16 @@ abstract class FileCommitProtocol { def newTaskTempFile(taskContext: TaskAttemptContext, dir: Option[String], ext: String): String /** + * Similar to newTaskTempFile(), but allows files to committed to an absolute output location. + * Depending on the implementation, there may be weaker guarantees around adding files this way. + */ + def newTaskTempFileAbsPath( --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87111922 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -42,17 +44,21 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) /** OutputCommitter from Hadoop is not serializable so marking it transient. */ @transient private var committer: OutputCommitter = _ + /** + * Tracks files staged by this task for absolute output paths. These outputs are not managed by + * the Hadoop OutputCommitter, so we must move these to their final locations on job commit. + */ + @transient private var addedAbsPathFiles: mutable.Map[String, String] = null + + private def absPathStagingDir: Path = new Path(path, "_temporary-" + jobId) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87112095 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -350,13 +350,15 @@ case class BroadcastHint(child: LogicalPlan) extends UnaryNode { * Options for writing new data into a table. * * @param enabled whether to overwrite existing data in the table. - * @param specificPartition only data in the specified partition will be overwritten. + * @param staticPartitionKeys if non-empty, specifies that we only want to overwrite partitions --- End diff -- Yep it is in the next sentence. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87113460 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -182,41 +182,53 @@ case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] { "Cannot overwrite a path that is also being read from.") } - val overwritingSinglePartition = -overwrite.specificPartition.isDefined && + val partitionSchema = query.resolve( +t.partitionSchema, t.sparkSession.sessionState.analyzer.resolver) + val partitionsTrackedByCatalog = t.sparkSession.sessionState.conf.manageFilesourcePartitions && +l.catalogTable.isDefined && l.catalogTable.get.partitionColumnNames.nonEmpty && l.catalogTable.get.tracksPartitionsInCatalog - val effectiveOutputPath = if (overwritingSinglePartition) { -val partition = t.sparkSession.sessionState.catalog.getPartition( - l.catalogTable.get.identifier, overwrite.specificPartition.get) -new Path(partition.storage.locationUri.get) - } else { -outputPath - } - - val effectivePartitionSchema = if (overwritingSinglePartition) { -Nil - } else { -query.resolve(t.partitionSchema, t.sparkSession.sessionState.analyzer.resolver) + var initialMatchingPartitions: Seq[TablePartitionSpec] = Nil + var customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty + + // When partitions are tracked by the catalog, compute all custom partition locations that + // may be relevant to the insertion job. + if (partitionsTrackedByCatalog) { +val matchingPartitions = t.sparkSession.sessionState.catalog.listPartitions( --- End diff -- You also need the set of matching partitions (including those with default locations) in order to determine which ones to delete at the end of an overwrite call. This makes the optimization quite messy, so I'd rather not push it to the catalog for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87129853 --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala --- @@ -86,6 +86,16 @@ abstract class FileCommitProtocol { def newTaskTempFile(taskContext: TaskAttemptContext, dir: Option[String], ext: String): String /** + * Similar to newTaskTempFile(), but allows files to committed to an absolute output location. + * Depending on the implementation, there may be weaker guarantees around adding files this way. + */ + def newTaskTempFileAbsPath( --- End diff -- I thought about combining it but I think the method semantics become too subtle then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87112188 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -418,6 +418,8 @@ case class DataSource( val plan = InsertIntoHadoopFsRelationCommand( outputPath, +Map.empty, --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87111758 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -42,17 +44,21 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) /** OutputCommitter from Hadoop is not serializable so marking it transient. */ @transient private var committer: OutputCommitter = _ + /** + * Tracks files staged by this task for absolute output paths. These outputs are not managed by + * the Hadoop OutputCommitter, so we must move these to their final locations on job commit. + */ + @transient private var addedAbsPathFiles: mutable.Map[String, String] = null --- End diff -- They are files. We need to track the unique output location of each file here in order to know where to place it. We could use directories, but they would end up with one file each anyways. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15814#discussion_r87112037 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,18 +178,13 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging { "partitions with value: " + dynamicPartitionKeys.keys.mkString("[", ",", "]"), ctx) } val overwrite = ctx.OVERWRITE != null -val overwritePartition = - if (overwrite && partitionKeys.nonEmpty && dynamicPartitionKeys.isEmpty) { -Some(partitionKeys.map(t => (t._1, t._2.get))) - } else { -None - } +val staticPartitionKeys = partitionKeys.filter(_._2.nonEmpty).map(t => (t._1, t._2.get)) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15814 **[Test build #68387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68387/consoleFull)** for PR 15814 at commit [`4296612`](https://github.com/apache/spark/commit/4296612bd373bc1118b2cbfc10530b1895aaef60). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15233 **[Test build #68386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68386/consoleFull)** for PR 15233 at commit [`c6d3acd`](https://github.com/apache/spark/commit/c6d3acd953ad70f6d66252d13e6cf4c0e507d33c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/15820 Wow, looks like the new github comment interface did all kinds of weird things, apologies about that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87129126 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false +if (offset != nextOffsetInFetchedData) { + logInfo(s"Initial fetch for $topicPartition $offset") + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} else if (!fetchedData.hasNext()) { + // The last pre-fetched data has been drained. + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} +if (outOfOffset) { + val beginningOffset = getBeginningOffset() + if (beginningOffset <= offset) { +val latestOffset = getLatestOffset() +if (latestOffset <= offset) { + // Case 3 or 6 + logWarning(s"Offset ${offset} is later than the latest offset $latestOffset. " + +s"Skipped [$offset, $untilOffset)") + reset() + return null +} else { + // Case 4 or 5 + getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs) --- End diff -- - Why is this not an early return? - The arguments to the recursive function have not changed at this point, right? Why does it terminate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15233: [SPARK-17659] [SQL] Partitioned View is Not Suppo...
GitHub user gatorsmile reopened a pull request: https://github.com/apache/spark/pull/15233 [SPARK-17659] [SQL] Partitioned View is Not Supported By SHOW CREATE TABLE ### What changes were proposed in this pull request? `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE TABLE should not support it like the other Hive-only features. This PR is to issue an exception when detecting the view is a partitioned view. ### How was this patch tested? Added a test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark partitionedView Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15233 commit c6d3acd953ad70f6d66252d13e6cf4c0e507d33c Author: gatorsmileDate: 2016-09-25T02:57:11Z fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87129981 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false --- End diff -- Can this var be eliminated by just using a single try around the if / else? It's the same catch condition in either case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87129817 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false +if (offset != nextOffsetInFetchedData) { + logInfo(s"Initial fetch for $topicPartition $offset") + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} else if (!fetchedData.hasNext()) { + // The last pre-fetched data has been drained. + seek(offset) --- End diff -- I don't think it's necessary to seek every time the fetched data is empty, in normal operation the poll should return the next offset, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87129927 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false +if (offset != nextOffsetInFetchedData) { + logInfo(s"Initial fetch for $topicPartition $offset") + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} else if (!fetchedData.hasNext()) { + // The last pre-fetched data has been drained. + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} +if (outOfOffset) { + val beginningOffset = getBeginningOffset() + if (beginningOffset <= offset) { +val latestOffset = getLatestOffset() +if (latestOffset <= offset) { + // Case 3 or 6 + logWarning(s"Offset ${offset} is later than the latest offset $latestOffset. " + +s"Skipped [$offset, $untilOffset)") + reset() + return null +} else { + // Case 4 or 5 + getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs) +} + } else { +// Case 1 or 7 +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $beginningOffset})") +return getAndIgnoreLostData(beginningOffset, untilOffset, pollTimeoutMs) + } +} else { + if (!fetchedData.hasNext()) { +// We cannot fetch anything after `polling`. Two possible cases: +// - `beginningOffset` is `offset` but there is nothing for `beginningOffset` right now. +// - Cannot fetch any date before timeout. +// Because there is no way to distinguish, just skip the rest offsets in the current +// partition. +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)") +reset() +return null + } + + val record = fetchedData.next() + if (record.offset >= untilOffset) { +// Case 2 +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)") +reset() +return null + } else { +if (record.offset != offset) { + // Case 1 + logWarning(s"Buffer miss for $groupId $topicPartition [$offset,
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15233 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15233: [SPARK-17659] [SQL] Partitioned View is Not Supported By...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15233 Sure, let me reopen it. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87127811 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false +if (offset != nextOffsetInFetchedData) { + logInfo(s"Initial fetch for $topicPartition $offset") + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} else if (!fetchedData.hasNext()) { + // The last pre-fetched data has been drained. + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} +if (outOfOffset) { + val beginningOffset = getBeginningOffset() + if (beginningOffset <= offset) { +val latestOffset = getLatestOffset() +if (latestOffset <= offset) { + // Case 3 or 6 + logWarning(s"Offset ${offset} is later than the latest offset $latestOffset. " + +s"Skipped [$offset, $untilOffset)") + reset() + return null +} else { + // Case 4 or 5 + getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs) +} + } else { +// Case 1 or 7 +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $beginningOffset})") +return getAndIgnoreLostData(beginningOffset, untilOffset, pollTimeoutMs) + } +} else { + if (!fetchedData.hasNext()) { +// We cannot fetch anything after `polling`. Two possible cases: +// - `beginningOffset` is `offset` but there is nothing for `beginningOffset` right now. +// - Cannot fetch any date before timeout. +// Because there is no way to distinguish, just skip the rest offsets in the current +// partition. +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)") +reset() +return null + } + + val record = fetchedData.next() + if (record.offset >= untilOffset) { +// Case 2 +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)") +reset() +return null + } else { +if (record.offset != offset) { + // Case 1 + logWarning(s"Buffer miss for $groupId $topicPartition [$offset,
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87130059 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false +if (offset != nextOffsetInFetchedData) { + logInfo(s"Initial fetch for $topicPartition $offset") + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} else if (!fetchedData.hasNext()) { + // The last pre-fetched data has been drained. + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} +if (outOfOffset) { + val beginningOffset = getBeginningOffset() + if (beginningOffset <= offset) { +val latestOffset = getLatestOffset() +if (latestOffset <= offset) { + // Case 3 or 6 + logWarning(s"Offset ${offset} is later than the latest offset $latestOffset. " + +s"Skipped [$offset, $untilOffset)") + reset() + return null +} else { + // Case 4 or 5 + getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs) --- End diff -- - Why isn't this an early return? - Unless I'm misreading, this is a recursive call without changing the arguments. Why is it guaranteed to terminate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87128373 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false +if (offset != nextOffsetInFetchedData) { + logInfo(s"Initial fetch for $topicPartition $offset") + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} else if (!fetchedData.hasNext()) { + // The last pre-fetched data has been drained. + seek(offset) + try { +poll(pollTimeoutMs) + } catch { +case e: OffsetOutOfRangeException => + logWarning(s"Cannot fetch offset $offset, try to recover from the beginning offset", e) + outOfOffset = true + } +} +if (outOfOffset) { + val beginningOffset = getBeginningOffset() + if (beginningOffset <= offset) { +val latestOffset = getLatestOffset() +if (latestOffset <= offset) { + // Case 3 or 6 + logWarning(s"Offset ${offset} is later than the latest offset $latestOffset. " + +s"Skipped [$offset, $untilOffset)") + reset() + return null +} else { + // Case 4 or 5 + getAndIgnoreLostData(offset, untilOffset, pollTimeoutMs) +} + } else { +// Case 1 or 7 +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $beginningOffset})") +return getAndIgnoreLostData(beginningOffset, untilOffset, pollTimeoutMs) + } +} else { + if (!fetchedData.hasNext()) { +// We cannot fetch anything after `polling`. Two possible cases: +// - `beginningOffset` is `offset` but there is nothing for `beginningOffset` right now. +// - Cannot fetch any date before timeout. +// Because there is no way to distinguish, just skip the rest offsets in the current +// partition. +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)") +reset() +return null + } + + val record = fetchedData.next() + if (record.offset >= untilOffset) { +// Case 2 +logWarning(s"Buffer miss for $groupId $topicPartition [$offset, $untilOffset)") +reset() +return null + } else { +if (record.offset != offset) { + // Case 1 + logWarning(s"Buffer miss for $groupId $topicPartition [$offset,
[GitHub] spark pull request #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/15820#discussion_r87129204 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala --- @@ -83,6 +86,113 @@ private[kafka010] case class CachedKafkaConsumer private( record } + @tailrec + final def getAndIgnoreLostData( + offset: Long, + untilOffset: Long, + pollTimeoutMs: Long): ConsumerRecord[Array[Byte], Array[Byte]] = { +// scalastyle:off +// When `failOnDataLoss` is `false`, we need to handle the following cases (note: untilOffset and latestOffset are exclusive): +// 1. Some data are aged out, and `offset < beginningOffset <= untilOffset - 1 <= latestOffset - 1` +// Seek to the beginningOffset and fetch the data. +// 2. Some data are aged out, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// 3. The topic is deleted. +// There is nothing to fetch, return null. +// 4. The topic is deleted and recreated, and `beginningOffset <= offset <= untilOffset - 1 <= latestOffset - 1`. +// We cannot detect this case. We can still fetch data like nothing happens. +// 5. The topic is deleted and recreated, and `beginningOffset <= offset < latestOffset - 1 < untilOffset - 1`. +// Same as 4. +// 6. The topic is deleted and recreated, and `beginningOffset <= latestOffset - 1 < offset <= untilOffset - 1`. +// There is nothing to fetch, return null. +// 7. The topic is deleted and recreated, and `offset < beginningOffset <= untilOffset - 1`. +// Same as 1. +// 8. The topic is deleted and recreated, and `offset <= untilOffset - 1 < beginningOffset`. +// There is nothing to fetch, return null. +// scalastyle:on +if (offset >= untilOffset) { + // Case 2 or 8 + // We seek to beginningOffset but beginningOffset >= untilOffset + reset() + return null +} + +logDebug(s"Get $groupId $topicPartition nextOffset $nextOffsetInFetchedData requested $offset") +var outOfOffset = false --- End diff -- Can't this var be eliminated with a singly try around the following if/else? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15814: [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15814 **[Test build #68385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68385/consoleFull)** for PR 15814 at commit [`fbd7b42`](https://github.com/apache/spark/commit/fbd7b425290644fcfdcaf92491aaae2ac67eefb9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org