[GitHub] spark issue #19141: [SPARK-21384] [YARN] Spark + YARN fails with LocalFileSy...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19141 I see, thanks for the explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18981: Fixed pandoc dependency issue in python/setup.py
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18981 Merged to master and branch-2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 It's not ok to follow Spark current behavior?(It will be different from Hive) I make this pr because we are migrating from Hive to Spark and lots of our users are using this function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19151: [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19151 **[Test build #81489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81489/testReport)** for PR 19151 at commit [`f05f281`](https://github.com/apache/spark/commit/f05f281eb5fda2b68e7e5f7a1a61a87a7a4bc467). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18982 Hmmm, I can repeat the error with Python3, I'll look into it tomorrow --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18982 No problem @holdenk, I updated using `transform()` on the test. See if it looks ok to you now (pending Jenkins). Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18982 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81487/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18982 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18982 **[Test build #81487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81487/testReport)** for PR 18982 at commit [`482c025`](https://github.com/apache/spark/commit/482c02507e38909e934a9f2b7ea06612eaea5ce0). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17096 **[Test build #81488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81488/testReport)** for PR 17096 at commit [`830b4fe`](https://github.com/apache/spark/commit/830b4fe1f71befb97debd9286306b3f872eb1c09). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17096 Jenkins retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19151: [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19151 **[Test build #81486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81486/testReport)** for PR 19151 at commit [`4fc4d05`](https://github.com/apache/spark/commit/4fc4d05fd8dfa5397f790051196893d2b6fb2ca5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18982 **[Test build #81487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81487/testReport)** for PR 18982 at commit [`482c025`](https://github.com/apache/spark/commit/482c02507e38909e934a9f2b7ea06612eaea5ce0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19151: [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19151 cc @gatorsmile for review. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19151: [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSub...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/19151 [SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery should not produce unresolved query plans ## What changes were proposed in this pull request? This is a follow-up of #19050 to deal with `ExistenceJoin` case. ## How was this patch tested? Added test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-21835-followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19151.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19151 commit 4fc4d05fd8dfa5397f790051196893d2b6fb2ca5 Author: Liang-Chi Hsieh Date: 2017-09-07T00:04:07Z Deal with ExistenceJoin case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18982 Jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19150: [SPARK-21939][TEST] Use TimeLimits instead of Timeouts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19150 **[Test build #81485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81485/testReport)** for PR 19150 at commit [`ab339b3`](https://github.com/apache/spark/commit/ab339b31b311035ebb75e8f079000d306cab16b8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18982 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81484/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18982 **[Test build #81484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81484/testReport)** for PR 18982 at commit [`482c025`](https://github.com/apache/spark/commit/482c02507e38909e934a9f2b7ea06612eaea5ce0). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18982 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19150: [SPARK-21939][TEST] Use TimeLimits instead of Tim...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19150 [SPARK-21939][TEST] Use TimeLimits instead of Timeouts ## What changes were proposed in this pull request? Since ScalaTest 3.0.0, `org.scalatest.concurrent.TimeLimits` is deprecated. This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`. ```scala -import org.scalatest.concurrent.Timeouts._ +import org.scalatest.concurrent.TimeLimits._ ``` ## How was this patch tested? Pass the existing test suites. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-21939 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19150.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19150 commit ab339b31b311035ebb75e8f079000d306cab16b8 Author: Dongjoon Hyun Date: 2017-09-06T23:22:11Z [SPARK-21939][TEST] Use TimeLimits instead of Timeouts --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18982: [SPARK-21685][PYTHON][ML] PySpark Params isSet state sho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18982 **[Test build #81484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81484/testReport)** for PR 18982 at commit [`482c025`](https://github.com/apache/spark/commit/482c02507e38909e934a9f2b7ea06612eaea5ce0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19140 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19140 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81477/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19140 **[Test build #81477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81477/testReport)** for PR 19140 at commit [`98f0ff2`](https://github.com/apache/spark/commit/98f0ff2a655c398e5b502ce2b340dfac88b385e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19136: [DO NOT MERGE][SPARK-15689][SQL] data source v2
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/19136 Thanks for pinging me. I left comments on the older PR, since other discussion was already there. If you'd prefer comments here, just let me know. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137416501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- I think this PR passes the above two test cases, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137416371 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- DDLSuite and HiveDDLSuite have them here. - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L2045-L2070 - https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L1736 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict between ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19149 **[Test build #81483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81483/testReport)** for PR 19149 at commit [`e5501e1`](https://github.com/apache/spark/commit/e5501e1f46317a82b915d952f4ee192e5eb8e61d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137415411 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- Could you add test cases and ensure the partitioning columns with special characters work? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19148: [SPARK-21936][SQL][WIP] backward compatibility te...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19148#discussion_r137413347 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File +import java.sql.Timestamp +import java.util.Date + +import scala.collection.mutable.ArrayBuffer + +import org.scalatest.concurrent.Timeouts +import org.scalatest.exceptions.TestFailedDueToTimeoutException +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.{SparkFunSuite, TestUtils} +import org.apache.spark.sql.{QueryTest, Row, SparkSession} +import org.apache.spark.sql.test.ProcessTestUtils.ProcessOutputCapturer +import org.apache.spark.util.Utils + +/** + * Test HiveExternalCatalog backward compatibility. + * + * Note that, this test suite will automatically download spark binary packages of different + * versions to a local directory `/tmp/spark-test`. If there is already a spark folder with + * expected version under this local directory, e.g. `/tmp/spark-test/spark-2.0.3`, we will skip the + * downloading for this spark version. + */ +class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts { + private val wareHousePath = Utils.createTempDir(namePrefix = "warehouse") --- End diff -- This single warehouse seems to make a failure. Maybe, Spark 2.0.2 tries to read 2.2.0? ``` build/sbt -Phive "project hive" "test-only *.HiveExternalCatalogVersionsSuite" ... [info] - backward compatibility *** FAILED *** (17 seconds, 712 milliseconds) [info] spark-submit returned with exit code 1. ... [info] 2017-09-06 16:07:41.744 - stderr> Caused by: java.sql.SQLException: Database at /Users/dongjoon/PR-19148/target/tmp/warehouse-d2818ad2-f141-4fc7-bc68-e7f67c89f3f4/metastore_db has an incompatible format with the current version of the software. The database was created by or upgraded by version 10.12. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19148: [SPARK-21936][SQL][WIP] backward compatibility te...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19148#discussion_r137412267 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File +import java.sql.Timestamp +import java.util.Date + +import scala.collection.mutable.ArrayBuffer + +import org.scalatest.concurrent.Timeouts +import org.scalatest.exceptions.TestFailedDueToTimeoutException +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.{SparkFunSuite, TestUtils} +import org.apache.spark.sql.{QueryTest, Row, SparkSession} +import org.apache.spark.sql.test.ProcessTestUtils.ProcessOutputCapturer +import org.apache.spark.util.Utils + +/** + * Test HiveExternalCatalog backward compatibility. + * + * Note that, this test suite will automatically download spark binary packages of different + * versions to a local directory `/tmp/spark-test`. If there is already a spark folder with + * expected version under this local directory, e.g. `/tmp/spark-test/spark-2.0.3`, we will skip the + * downloading for this spark version. + */ +class HiveExternalCatalogVersionsSuite extends SparkFunSuite with Timeouts { + private val wareHousePath = Utils.createTempDir(namePrefix = "warehouse") + private val sparkTestingDir = "/tmp/spark-test" + private val unusedJar = TestUtils.createJarWithClasses(Seq.empty) + + override def afterAll(): Unit = { +Utils.deleteRecursively(wareHousePath) +super.afterAll() + } + + // NOTE: This is an expensive operation in terms of time (10 seconds+). Use sparingly. + // This is copied from org.apache.spark.deploy.SparkSubmitSuite + private def runSparkSubmit(args: Seq[String], sparkHomeOpt: Option[String] = None): Unit = { +val sparkHome = sparkHomeOpt.getOrElse( + sys.props.getOrElse("spark.test.home", fail("spark.test.home is not set!"))) +val history = ArrayBuffer.empty[String] +val sparkSubmit = if (Utils.isWindows) { + // On Windows, `ProcessBuilder.directory` does not change the current working directory. + new File("..\\..\\bin\\spark-submit.cmd").getAbsolutePath +} else { + "./bin/spark-submit" +} +val commands = Seq(sparkSubmit) ++ args +val commandLine = commands.mkString("'", "' '", "'") + +val builder = new ProcessBuilder(commands: _*).directory(new File(sparkHome)) +val env = builder.environment() +env.put("SPARK_TESTING", "1") +env.put("SPARK_HOME", sparkHome) + +def captureOutput(source: String)(line: String): Unit = { + // This test suite has some weird behaviors when executed on Jenkins: + // + // 1. Sometimes it gets extremely slow out of unknown reason on Jenkins. Here we add a + //timestamp to provide more diagnosis information. + // 2. Log lines are not correctly redirected to unit-tests.log as expected, so here we print + //them out for debugging purposes. + val logLine = s"${new Timestamp(new Date().getTime)} - $source> $line" + // scalastyle:off println + println(logLine) + // scalastyle:on println + history += logLine +} + +val process = builder.start() +new ProcessOutputCapturer(process.getInputStream, captureOutput("stdout")).start() +new ProcessOutputCapturer(process.getErrorStream, captureOutput("stderr")).start() + +try { + val exitCode = failAfter(300.seconds) { process.waitFor() } + if (exitCode != 0) { +// include logs in output. Note that logging is async and may not have completed +// at the time this exception is raised +Thread.sleep(1000) +val historyLog = history.mkString("\n") +fail {
[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectori...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18659#discussion_r137412019 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.{SparkEnv, TaskContext} +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType, PythonRunner} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.arrow.{ArrowConverters, ArrowPayload} +import org.apache.spark.sql.types.{DataType, StructField, StructType} +import org.apache.spark.util.Utils + + +/** + * A physical plan that evaluates a [[PythonUDF]], + */ +case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) + extends SparkPlan { + + def children: Seq[SparkPlan] = child :: Nil + + override def producedAttributes: AttributeSet = AttributeSet(output.drop(child.output.length)) + + private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = { +udf.children match { + case Seq(u: PythonUDF) => +val (chained, children) = collectFunctions(u) +(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children) + case children => +// There should not be any other UDFs, or the children can't be evaluated directly. +assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty)) +(ChainedPythonFunctions(Seq(udf.func)), udf.children) +} + } + + protected override def doExecute(): RDD[InternalRow] = { +val inputRDD = child.execute().map(_.copy()) +val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536) +val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true) + +inputRDD.mapPartitions { iter => + + // The queue used to buffer input rows so we can drain it to + // combine input with output from Python. + val queue = HybridRowQueue(TaskContext.get().taskMemoryManager(), +new File(Utils.getLocalDir(SparkEnv.get.conf)), child.output.length) + TaskContext.get().addTaskCompletionListener({ ctx => +queue.close() + }) + + val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip + + // flatten all the arguments + val allInputs = new ArrayBuffer[Expression] + val dataTypes = new ArrayBuffer[DataType] + val argOffsets = inputs.map { input => +input.map { e => + if (allInputs.exists(_.semanticEquals(e))) { +allInputs.indexWhere(_.semanticEquals(e)) + } else { +allInputs += e +dataTypes += e.dataType +allInputs.length - 1 + } +}.toArray + }.toArray + val projection = newMutableProjection(allInputs, child.output) + val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) => +StructField(s"_$i", dt) + }) + + // Input iterator to Python: input rows are grouped so we send them in batches to Python. + // For each row, add it to the queue. + val projectedRowIter = iter.map { inputRow => +queue.add(inputRow.asInstanceOf[UnsafeRow]) +projection(inputRow) + } + + val context = TaskContext.get() + + val inputIterator = ArrowConverters.toPayloadIterator( + projectedRowIter, schema, conf.arrowMaxRecordsPerBatch, context). +map(_.asPythonSerializable) + + val schemaOut = St
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18659 Cool! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19112: [SPARK-21901][SS] Define toString for StateOperat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19112 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19112: [SPARK-21901][SS] Define toString for StateOperatorProgr...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19112 LGTM. Thanks! Merging to master and 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19096: [SPARK-21869][SS] A cached Kafka producer should ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19096#discussion_r137409030 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala --- @@ -43,8 +43,10 @@ private[kafka010] class KafkaWriteTask( * Writes key value data out to topics. */ def execute(iterator: Iterator[InternalRow]): Unit = { -producer = CachedKafkaProducer.getOrCreate(producerConfiguration) +val paramsSeq = CachedKafkaProducer.paramsToSeq(producerConfiguration) while (iterator.hasNext && failedWrite == null) { + // Prevent producer to get expired/evicted from guava cache.(SPARK-21869) + producer = CachedKafkaProducer.getOrCreate(paramsSeq) --- End diff -- This is really hacky. I'm wondering if we can track if a producer is using or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18975 **[Test build #81482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81482/testReport)** for PR 18975 at commit [`6c24b1b`](https://github.com/apache/spark/commit/6c24b1be90fdf0e65c80ae24f81c75d34f7e1542). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81473/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12646 **[Test build #81473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81473/testReport)** for PR 12646 at commit [`859510e`](https://github.com/apache/spark/commit/859510e6d796d138b08b10507f25535a077beffb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19046: [SPARK-18769][yarn] Limit resource requests based on RM'...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19046 Yeah if we are releasing and they reacquiring right away over and over again that would be bad, but I don't know when we would do that so more details would be great if possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18253: [SPARK-18838][CORE] Introduce multiple queues in LiveLis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18253 @bOOm-X I pushed some code to my repo: https://github.com/vanzin/spark/tree/SPARK-18838 Which is an attempt to do things the way I've been trying to explain. It tries to keep changes as local as possible to `LiveListenerBus`, creating a few types that implement the grouping and asynchronous behavior. You could do filtering by extending the new `AsyncListener`, for example, and adding it to the live listener bus. It's just a p.o.c. so I cut a few corners (like metrics), and I only ran `SparkListenerSuite`, but I'm just trying to show a different approach that leaves the `ListenerBus` hierarchy mostly the same as now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137406552 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2346,6 +2347,45 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } + test("insert overwrite directory") { --- End diff -- moved. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137406502 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration + +import org.apache.spark.internal.io.FileCommitProtocol +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.catalog.BucketSpec +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.command.DataWritingCommand +import org.apache.spark.sql.execution.datasources.FileFormatWriter +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} + +// Base trait from which all hive insert statement physical execution extends. +private[hive] trait SaveAsHiveFile extends DataWritingCommand { + + protected def saveAsHiveFile( + sparkSession: SparkSession, + plan: SparkPlan, + hadoopConf: Configuration, + fileSinkConf: FileSinkDesc, + outputLocation: String, + partitionAttributes: Seq[Attribute] = Nil, + bucketSpec: Option[BucketSpec] = None, + options: Map[String, String] = Map.empty): Unit = { --- End diff -- removed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18975 **[Test build #81481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81481/testReport)** for PR 18975 at commit [`28fcb39`](https://github.com/apache/spark/commit/28fcb39028d93ec6ecea9eecf289c0e88b6c9ae6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18975 Looks pretty good! Will do the final pass after addressing the above comments. Thanks for your work! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectori...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18659#discussion_r137402913 --- Diff: python/pyspark/sql/functions.py --- @@ -2112,7 +2113,7 @@ def wrapper(*args): @since(1.3) -def udf(f=None, returnType=StringType()): +def udf(f=None, returnType=StringType(), vectorized=False): --- End diff -- @felixcheung does this fit your idea for a more generic decorator? Not exclusively labeled as `pandas_udf`, just enable vectorization with a flag, e.g. `@udf(DoubleType(), vectorized=True)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137403025 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2346,6 +2347,45 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } + test("insert overwrite directory") { --- End diff -- DDLSuite.scala is becoming bigger and bigger. Move these two data source only test cases to `org.apache.spark.sql.sources.InsertSuite`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19046: [SPARK-18769][yarn] Limit resource requests based on RM'...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19046 I was told that the preemption issue was fixed in YARN (that's YARN-6210); so I don't think there's a need for this code currently (just use a fixed YARN if you want reliable preemption). Wilfred is on vacation so I can ask him details, but: > It turns out that the AM is releasing and then acquiring the reservations again and again until it has enough to run all tasks that it needs. That could be an issue (don't remember details of the Spark allocator code), but at the same time it sounds like a different issue than this was trying to fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137401845 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration + +import org.apache.spark.internal.io.FileCommitProtocol +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.catalog.BucketSpec +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.command.DataWritingCommand +import org.apache.spark.sql.execution.datasources.FileFormatWriter +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} + +// Base trait from which all hive insert statement physical execution extends. +private[hive] trait SaveAsHiveFile extends DataWritingCommand { + + protected def saveAsHiveFile( + sparkSession: SparkSession, + plan: SparkPlan, + hadoopConf: Configuration, + fileSinkConf: FileSinkDesc, + outputLocation: String, + partitionAttributes: Seq[Attribute] = Nil, + bucketSpec: Option[BucketSpec] = None, + options: Map[String, String] = Map.empty): Unit = { --- End diff -- remove `bucketSpec ` and `options `? Add them back only when we need it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81480/testReport)** for PR 18659 at commit [`4f6c950`](https://github.com/apache/spark/commit/4f6c95092066ee31a670ca827fbb892ac66df870). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137401720 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration + +import org.apache.spark.internal.io.FileCommitProtocol +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.catalog.BucketSpec +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.command.DataWritingCommand +import org.apache.spark.sql.execution.datasources.FileFormatWriter +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} + +// Base trait from which all hive insert statement physical execution extends. +private[hive] trait SaveAsHiveFile extends DataWritingCommand { + + protected def saveAsHiveFile( + sparkSession: SparkSession, + plan: SparkPlan, + hadoopConf: Configuration, + fileSinkConf: FileSinkDesc, + outputLocation: String, + partitionAttributes: Seq[Attribute] = Nil, + bucketSpec: Option[BucketSpec] = None, + options: Map[String, String] = Map.empty): Unit = { + +val sessionState = sparkSession.sessionState --- End diff -- removed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137401187 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -234,12 +82,8 @@ case class InsertIntoHiveTable( override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = { assert(children.length == 1) -val sessionState = sparkSession.sessionState val externalCatalog = sparkSession.sharedState.externalCatalog -val hiveVersion = externalCatalog.asInstanceOf[HiveExternalCatalog].client.version -val hadoopConf = sessionState.newHadoopConf() -val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") --- End diff -- ok. added it back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137401217 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.hadoop.conf.Configuration + +import org.apache.spark.internal.io.FileCommitProtocol +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.catalog.BucketSpec +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.command.DataWritingCommand +import org.apache.spark.sql.execution.datasources.FileFormatWriter +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} + +// Base trait from which all hive insert statement physical execution extends. +private[hive] trait SaveAsHiveFile extends DataWritingCommand { + + protected def saveAsHiveFile( + sparkSession: SparkSession, + plan: SparkPlan, + hadoopConf: Configuration, + fileSinkConf: FileSinkDesc, + outputLocation: String, + partitionAttributes: Seq[Attribute] = Nil, + bucketSpec: Option[BucketSpec] = None, + options: Map[String, String] = Map.empty): Unit = { + +val sessionState = sparkSession.sessionState --- End diff -- remove this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81479/testReport)** for PR 18659 at commit [`fdea603`](https://github.com/apache/spark/commit/fdea603ae0ac6a8c27ec8161920f8c77549784e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19149 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81476/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19149 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19149 **[Test build #81476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81476/testReport)** for PR 19149 at commit [`6fc7140`](https://github.com/apache/spark/commit/6fc714051bed53e01941c2d661ce955137534c95). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137399745 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -234,12 +82,8 @@ case class InsertIntoHiveTable( override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = { assert(children.length == 1) -val sessionState = sparkSession.sessionState val externalCatalog = sparkSession.sharedState.externalCatalog -val hiveVersion = externalCatalog.asInstanceOf[HiveExternalCatalog].client.version -val hadoopConf = sessionState.newHadoopConf() -val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") --- End diff -- How about keeping `stagingDir `? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19046: [SPARK-18769][yarn] Limit resource requests based on RM'...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19046 It might help if you can give an exact scenario where you see the issue and perhaps configs if those matter. meaning do you have like some set of minimum containers, a small idle timeout, etc.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137399408 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTmpPath.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.io.{File, IOException} +import java.net.URI +import java.text.SimpleDateFormat +import java.util.{Date, Locale, Random} + +import scala.util.control.NonFatal + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.exec.TaskRunner + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.hive.HiveExternalCatalog +import org.apache.spark.sql.hive.client.HiveVersion + +// Base trait for getting a temporary location for writing data +private[hive] trait HiveTmpPath extends RunnableCommand { + + var createdTempDir: Option[Path] = None + + private var stagingDir: String = "" + + def getExternalTmpPath( + sparkSession: SparkSession, + hadoopConf: Configuration, + path: Path): Path = { +import org.apache.spark.sql.hive.client.hive._ + +// Before Hive 1.1, when inserting into a table, Hive will create the staging directory under +// a common scratch directory. After the writing is finished, Hive will simply empty the table +// directory and move the staging directory to it. +// After Hive 1.1, Hive will create the staging directory under the table directory, and when +// moving staging directory to table directory, Hive will still empty the table directory, but +// will exclude the staging directory there. +// We have to follow the Hive behavior here, to avoid troubles. For example, if we create +// staging directory under the table director for Hive prior to 1.1, the staging directory will +// be removed by Hive when Hive is trying to empty the table directory. +val hiveVersionsUsingOldExternalTempPath: Set[HiveVersion] = Set(v12, v13, v14, v1_0) +val hiveVersionsUsingNewExternalTempPath: Set[HiveVersion] = Set(v1_1, v1_2, v2_0, v2_1) + +// Ensure all the supported versions are considered here. +assert(hiveVersionsUsingNewExternalTempPath ++ hiveVersionsUsingOldExternalTempPath == + allSupportedHiveVersions) + +val externalCatalog = sparkSession.sharedState.externalCatalog +val hiveVersion = externalCatalog.asInstanceOf[HiveExternalCatalog].client.version +stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") +val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") + +if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { + oldVersionExternalTempPath(path, hadoopConf, scratchDir) +} else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { + newVersionExternalTempPath(path, hadoopConf, stagingDir) +} else { + throw new IllegalStateException("Unsupported hive version: " + hiveVersion.fullVersion) +} + } + + def deleteExternalTmpPath(hadoopConf : Configuration) : Unit = { +// Attempt to delete the staging directory and the inclusive files. If failed, the files are +// expected to be dropped at the normal termination of VM since deleteOnExit is used. +try { + createdTempDir.foreach { path => +val fs = path.getFileSystem(hadoopConf) +if (fs.delete(path, true)) { + // If we successfully delete the staging directory, remove it from FileSystem's cache. + fs.cancelDeleteOnExit(path) +} + } +} catch { + case NonFatal(e) => +logWarning(s"Unable to delete staging directory: $stagi
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19140 LGTM pending tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137398964 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTmpPath.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.io.{File, IOException} +import java.net.URI +import java.text.SimpleDateFormat +import java.util.{Date, Locale, Random} + +import scala.util.control.NonFatal + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.exec.TaskRunner + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.hive.HiveExternalCatalog +import org.apache.spark.sql.hive.client.HiveVersion + +// Base trait for getting a temporary location for writing data +private[hive] trait HiveTmpPath extends RunnableCommand { + + var createdTempDir: Option[Path] = None + + private var stagingDir: String = "" + + def getExternalTmpPath( + sparkSession: SparkSession, + hadoopConf: Configuration, + path: Path): Path = { +import org.apache.spark.sql.hive.client.hive._ + +// Before Hive 1.1, when inserting into a table, Hive will create the staging directory under +// a common scratch directory. After the writing is finished, Hive will simply empty the table +// directory and move the staging directory to it. +// After Hive 1.1, Hive will create the staging directory under the table directory, and when +// moving staging directory to table directory, Hive will still empty the table directory, but +// will exclude the staging directory there. +// We have to follow the Hive behavior here, to avoid troubles. For example, if we create +// staging directory under the table director for Hive prior to 1.1, the staging directory will +// be removed by Hive when Hive is trying to empty the table directory. +val hiveVersionsUsingOldExternalTempPath: Set[HiveVersion] = Set(v12, v13, v14, v1_0) +val hiveVersionsUsingNewExternalTempPath: Set[HiveVersion] = Set(v1_1, v1_2, v2_0, v2_1) + +// Ensure all the supported versions are considered here. +assert(hiveVersionsUsingNewExternalTempPath ++ hiveVersionsUsingOldExternalTempPath == + allSupportedHiveVersions) + +val externalCatalog = sparkSession.sharedState.externalCatalog +val hiveVersion = externalCatalog.asInstanceOf[HiveExternalCatalog].client.version +stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging") +val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive") + +if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) { + oldVersionExternalTempPath(path, hadoopConf, scratchDir) +} else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) { + newVersionExternalTempPath(path, hadoopConf, stagingDir) +} else { + throw new IllegalStateException("Unsupported hive version: " + hiveVersion.fullVersion) +} + } + + def deleteExternalTmpPath(hadoopConf : Configuration) : Unit = { +// Attempt to delete the staging directory and the inclusive files. If failed, the files are +// expected to be dropped at the normal termination of VM since deleteOnExit is used. +try { + createdTempDir.foreach { path => +val fs = path.getFileSystem(hadoopConf) +if (fs.delete(path, true)) { + // If we successfully delete the staging directory, remove it from FileSystem's cache. + fs.cancelDeleteOnExit(path) +} + } +} catch { + case NonFatal(e) => +logWarning(s"Unable to delete staging directory: $stagi
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137398917 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -360,6 +360,27 @@ case class InsertIntoTable( } /** + * Insert query result into a directory. --- End diff -- added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137398869 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -360,6 +360,27 @@ case class InsertIntoTable( } /** + * Insert query result into a directory. + * + * @param isLocal Indicates whether the specified directory is local directory + * @param storage Info about output file, row and what serialization format + * @param provider Specifies what data source to use; only used for data source file. + * @param child The query to be executed + * @param overwrite If true, the existing directory will be overwritten + */ +case class InsertIntoDir( +isLocal: Boolean, +storage: CatalogStorageFormat, +provider: Option[String], +child: LogicalPlan, +overwrite: Boolean = true) + extends LogicalPlan { + + override def children: Seq[LogicalPlan] = child :: Nil + override def output: Seq[Attribute] = Seq.empty --- End diff -- updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137398776 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTmpPath.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.io.{File, IOException} +import java.net.URI +import java.text.SimpleDateFormat +import java.util.{Date, Locale, Random} + +import scala.util.control.NonFatal + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.exec.TaskRunner + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.hive.HiveExternalCatalog +import org.apache.spark.sql.hive.client.HiveVersion + +// Base trait for getting a temporary location for writing data +private[hive] trait HiveTmpPath extends RunnableCommand { --- End diff -- removed RunnableCommand, the classes that exten HiveTmpPath already extends RunnableCommand --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137398299 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,60 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a table: + * (tableIdentifier, partitionKeys, overwrite, exists). + */ + type InsertTableParams = (TableIdentifier, Map[String, Option[String]], Boolean, Boolean) + + /** + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY + * operation to logical plan */ private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { +ctx match { + case table : InsertIntoTableContext => --- End diff -- updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19149 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19149 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81475/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137397811 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,60 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a table: + * (tableIdentifier, partitionKeys, overwrite, exists). + */ + type InsertTableParams = (TableIdentifier, Map[String, Option[String]], Boolean, Boolean) + + /** + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY --- End diff -- added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19149 **[Test build #81475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81475/testReport)** for PR 19149 at commit [`0be3b67`](https://github.com/apache/spark/commit/0be3b670713e3de5951b0c3d7abb53f2cbe9d96a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19046: [SPARK-18769][yarn] Limit resource requests based on RM'...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19046 Unfortunately that isn't clear to me as to the cause or what the issue is with the yarn side. I'm not sure what he means by "AM is releasing and then acquiring the reservations again and again until it has enough to run all tasks that it needs" Also do you know if "Due to the sustained backlog trigger doubling the request size each time it floods the scheduler with requests." Is talking about the exponential increase in the container reqeusts that spark does? ie if we did all the requests at once up front would it be better. I would also like to know what issue this causes in yarn. His last statement is also not always true. MR AM only takes headroom into account with slowstart and preempting reduces to run maps. It may very well be that you are using slow start and if you have it configured very aggressive (meaning start reduces very early) it could be spread out quite a bit but you are also possibly wasting resources for the tradeoff of maybe finishing sooner depending on the job. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81478/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81478/testReport)** for PR 18659 at commit [`3efa7f2`](https://github.com/apache/spark/commit/3efa7f2cd9ea222ff27f9d39420c1c2c256d4fc2). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19135: [SPARK-21923][CORE]Avoid call reserveUnrollMemoryForThis...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19135 It would be great to test the perf on executors with various corsPerExecutor settings to ensure we don't bring in regressions by the code change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21404][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81478/testReport)** for PR 18659 at commit [`3efa7f2`](https://github.com/apache/spark/commit/3efa7f2cd9ea222ff27f9d39420c1c2c256d4fc2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19046: [SPARK-18769][yarn] Limit resource requests based on RM'...
Github user Tagar commented on the issue: https://github.com/apache/spark/pull/19046 @tgravescs, here's quote from Wilfred Spiegelenburg - hope it answers both of your questions. > The behaviour I discussed earlier around the Spark AM reservations is not optimal. It turns out that the AM is releasing and then acquiring the reservations again and again until it has enough to run all tasks that it needs. This seems to be triggered by getting a container assigned to the application by the scheduler. Due to the sustained backlog trigger doubling the request size each time it floods the scheduler with requests. This issue will be logged as an internal jira at first. The next steps will be to discuss that behaviour with the Spark team with the goal of making it behave better on the cluster. The MR AM does behave better in this respect as it takes into account the available resources for the application via what is called "headroom". The Spark AM does not do this. thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19046: [SPARK-18769][yarn] Limit resource requests based on RM'...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19046 @Tagar can you be more specific about the problems you are seeing? how does this affect preemption? Why don't you see the same issues on MapReduce/Tez? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137390245 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- It's okay. Inside `checkDataSchemaFieldNames`, we only uses `table.dataSchema` like the following. ``` ParquetSchemaConverter.checkFieldNames(table.dataSchema) ``` For the partition columns, we have been allowing the special characters. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137388410 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -360,6 +360,27 @@ case class InsertIntoTable( } /** + * Insert query result into a directory. --- End diff -- Please add the comment like > Note that this plan is unresolved and has to be replaced by the concrete implementations during analysis. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137388263 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -360,6 +360,27 @@ case class InsertIntoTable( } /** + * Insert query result into a directory. + * + * @param isLocal Indicates whether the specified directory is local directory + * @param storage Info about output file, row and what serialization format + * @param provider Specifies what data source to use; only used for data source file. + * @param child The query to be executed + * @param overwrite If true, the existing directory will be overwritten + */ +case class InsertIntoDir( +isLocal: Boolean, +storage: CatalogStorageFormat, +provider: Option[String], +child: LogicalPlan, +overwrite: Boolean = true) + extends LogicalPlan { + + override def children: Seq[LogicalPlan] = child :: Nil + override def output: Seq[Attribute] = Seq.empty --- End diff -- Set `override lazy val resolved: Boolean = false` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137387725 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTmpPath.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.io.{File, IOException} +import java.net.URI +import java.text.SimpleDateFormat +import java.util.{Date, Locale, Random} + +import scala.util.control.NonFatal + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.exec.TaskRunner + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.hive.HiveExternalCatalog +import org.apache.spark.sql.hive.client.HiveVersion + +// Base trait for getting a temporary location for writing data +private[hive] trait HiveTmpPath extends RunnableCommand { --- End diff -- No need to extend `RunnableCommand ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137387842 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTmpPath.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import java.io.{File, IOException} +import java.net.URI +import java.text.SimpleDateFormat +import java.util.{Date, Locale, Random} + +import scala.util.control.NonFatal + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.exec.TaskRunner + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.hive.HiveExternalCatalog +import org.apache.spark.sql.hive.client.HiveVersion + +// Base trait for getting a temporary location for writing data +private[hive] trait HiveTmpPath extends RunnableCommand { --- End diff -- Instead, we should extend it in the class that extend HiveTmpPath --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137386317 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,60 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a table: + * (tableIdentifier, partitionKeys, overwrite, exists). + */ + type InsertTableParams = (TableIdentifier, Map[String, Option[String]], Boolean, Boolean) + + /** + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY + * operation to logical plan */ private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { +ctx match { + case table : InsertIntoTableContext => --- End diff -- Nit: `table :`-> `table:` The same issue in line 206 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r137385849 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,60 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a table: + * (tableIdentifier, partitionKeys, overwrite, exists). + */ + type InsertTableParams = (TableIdentifier, Map[String, Option[String]], Boolean, Boolean) + + /** + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY --- End diff -- Let us put the complete syntax here. ``` {{{ INSERT OVERWRITE TABLE tableIdentifier [partitionSpec [IF NOT EXISTS]]? INSERT INTO [TABLE] tableIdentifier [partitionSpec] INSERT OVERWRITE [LOCAL] DIRECTORY path=STRING [rowFormat] [createFileFormat] INSERT OVERWRITE [LOCAL] DIRECTORY [path=STRING] tableProvider [OPTIONS options=tablePropertyList] }}} ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18975 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81468/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18975 **[Test build #81468 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81468/testReport)** for PR 18975 at commit [`b461e00`](https://github.com/apache/spark/commit/b461e00e425660e33fdbc24a75884a2a2e2da4b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19086 After more discussion with others, this behavior change really bothers us. Both changes do not sound good to us. Where is the requirement from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] ORC/Parquet table should not c...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r137383267 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -201,13 +201,14 @@ case class AlterTableAddColumnsCommand( // make sure any partition columns are at the end of the fields val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema +val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) +DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema)) --- End diff -- `newSchema ` also contains `partition schema`. How about partition schema? Do we have the same limits on it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19140 **[Test build #81477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81477/testReport)** for PR 19140 at commit [`98f0ff2`](https://github.com/apache/spark/commit/98f0ff2a655c398e5b502ce2b340dfac88b385e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19083: [SPARK-21871][SQL] Check actual bytecode size whe...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19083#discussion_r137378351 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -98,19 +99,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { -var strExpr: Expression = Literal("abc") -for (_ <- 1 to 150) { - strExpr = Decode(Encode(strExpr, "utf-8"), "utf-8") -} +// Set the max value at `WHOLESTAGE_HUGE_METHOD_LIMIT` to compile gen'd code by janino +withSQLConf(SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT.key -> Int.MaxValue.toString) { --- End diff -- Why do we need to change value for `WHOLESTAGE_HUGE_METHOD_LIMIT` while this test is not for whole-stage codegen? We could select better naming for `SQLConf.WHOLESTAGE_HUGE_METHOD_LIMIT`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17435 Gentle ping (going through old PySpark PRs) :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19149 **[Test build #81476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81476/testReport)** for PR 19149 at commit [`6fc7140`](https://github.com/apache/spark/commit/6fc714051bed53e01941c2d661ce955137534c95). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule confliction betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19149 **[Test build #81475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81475/testReport)** for PR 19149 at commit [`0be3b67`](https://github.com/apache/spark/commit/0be3b670713e3de5951b0c3d7abb53f2cbe9d96a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19140 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19140 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81474/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19140: [SPARK-21890] Credentials not being passed to add the to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19140 **[Test build #81474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81474/testReport)** for PR 19140 at commit [`1184a73`](https://github.com/apache/spark/commit/1184a735124d6119f515257a75b0721efc3adff9). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org