[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-19 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r635099518



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +99,19 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: 
Configuration) {
+if (saveMode != SaveMode.Overwrite && 
!SQLConf.get.allowNonEmptyLocationInCTAS) {

Review comment:
   For SQL CTAS, the mode is always `ErrorIfExists`, and spark just checks 
whether the table exists or not. It was not checking non-empty data path.
   In that case, I'll change the condition as  `if (saveMode == 
SaveMode.**ErrorIfExists** && !SQLConf.get.allowNonEmptyLocationInCTAS)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-19 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r635099518



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +99,19 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: 
Configuration) {
+if (saveMode != SaveMode.Overwrite && 
!SQLConf.get.allowNonEmptyLocationInCTAS) {

Review comment:
   For SQL CTAS, the mode is always `ErrorIfExists`. 
   In that case, I'll change the condition as  `if (saveMode == 
SaveMode.**ErrorIfExists** && !SQLConf.get.allowNonEmptyLocationInCTAS)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-19 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r635076302



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala
##
@@ -235,10 +235,10 @@ class CreateTableAsSelectSuite extends DataSourceTest 
with SharedSparkSession {
   }
 
   test("create table using as select - with overridden max number of buckets") 
{
-def createTableSql(numBuckets: Int): String =
+def createTableSql(tablePath: String, numBuckets: Int): String =
   s"""
  |CREATE TABLE t USING PARQUET
- |OPTIONS (PATH '${path.toURI}')
+ |OPTIONS (PATH '$tablePath')

Review comment:
   Yes, in this test case, the function  `createTableSql ` will be called 
twice inside `Seq(11, maxNrBuckets).foreach(...)`. As this PR doesn't allow 
CTAS with a non-empty location, the second call of  `createTableSql` will fail 
if the same `path` used. That is why for each iteration a new table location is 
created




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r632289902



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +99,20 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: 
Configuration) {
+if (saveMode != SaveMode.Overwrite &&
+  !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = filePath.getFileSystem(hadoopConf)
+  if (fs.exists(filePath) &&
+fs.getFileStatus(filePath).isDirectory &&
+fs.listStatus(filePath).length != 0) {
+throw new AnalysisException(
+  s"CREATE-TABLE-AS-SELECT cannot create table with location to a 
non-empty directory " +
+s"${tablePath} . To disable the behavior, " +
+s"set 'spark.sql.legacy.allowNonEmptyLocationInCTAS to true.")

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r632289297



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -598,6 +598,37 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
+  test("SPARK-28551: CTAS Hive Table should be with non-existent or empty 
location") {
+def executeCTASWithNonEmptyLocation(tempLocation: String) {
+  sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION 
'$tempLocation/ctas1'")
+  sql("INSERT INTO TABLE ctas1 SELECT 'A' ")
+  sql(s"""CREATE TABLE ctas_with_existing_location stored as rcfile 
LOCATION
+   |'$tempLocation' AS SELECT key k, value FROM src ORDER BY k, 
value""".stripMargin)
+}
+
+Seq("false", "true").foreach { convertCTASFlag =>
+  Seq("false", "true").foreach { allowNonEmptyDirFlag =>
+withSQLConf(
+  SQLConf.CONVERT_CTAS.key -> convertCTASFlag,
+  SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS.key -> 
allowNonEmptyDirFlag) {
+  withTempDir { dir =>
+val tempLocation = dir.toURI.toString
+withTable("ctas1", "ctas_with_existing_location") {
+  if (!spark.conf.get(SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS)) {

Review comment:
   Done, changed type of val `allowNonEmptyDirFlag` to Boolean




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r632288726



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -598,6 +598,37 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
+  test("SPARK-28551: CTAS Hive Table should be with non-existent or empty 
location") {
+def executeCTASWithNonEmptyLocation(tempLocation: String) {
+  sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION 
'$tempLocation/ctas1'")
+  sql("INSERT INTO TABLE ctas1 SELECT 'A' ")
+  sql(s"""CREATE TABLE ctas_with_existing_location stored as rcfile 
LOCATION
+   |'$tempLocation' AS SELECT key k, value FROM src ORDER BY k, 
value""".stripMargin)
+}
+
+Seq("false", "true").foreach { convertCTASFlag =>

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r632288384



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +99,20 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: 
Configuration) {
+if (saveMode != SaveMode.Overwrite &&
+  !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = filePath.getFileSystem(hadoopConf)
+  if (fs.exists(filePath) &&
+fs.getFileStatus(filePath).isDirectory &&

Review comment:
   Done

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +99,20 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: 
Configuration) {
+if (saveMode != SaveMode.Overwrite &&
+  !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) {

Review comment:
   Done

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +99,20 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: 
Configuration) {
+if (saveMode != SaveMode.Overwrite &&
+  !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = filePath.getFileSystem(hadoopConf)
+  if (fs.exists(filePath) &&
+fs.getFileStatus(filePath).isDirectory &&
+fs.listStatus(filePath).length != 0) {
+throw new AnalysisException(
+  s"CREATE-TABLE-AS-SELECT cannot create table with location to a 
non-empty directory " +
+s"${tablePath} . To disable the behavior, " +

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r632288269



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -598,6 +598,37 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
+  test("SPARK-28551: CTAS Hive Table should be with non-existent or empty 
location") {
+def executeCTASWithNonEmptyLocation(tempLocation: String) {
+  sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION 
'$tempLocation/ctas1'")
+  sql("INSERT INTO TABLE ctas1 SELECT 'A' ")
+  sql(s"""CREATE TABLE ctas_with_existing_location stored as rcfile 
LOCATION
+   |'$tempLocation' AS SELECT key k, value FROM src ORDER BY k, 
value""".stripMargin)
+}
+
+Seq("false", "true").foreach { convertCTASFlag =>
+  Seq("false", "true").foreach { allowNonEmptyDirFlag =>
+withSQLConf(
+  SQLConf.CONVERT_CTAS.key -> convertCTASFlag,
+  SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS.key -> 
allowNonEmptyDirFlag) {
+  withTempDir { dir =>
+val tempLocation = dir.toURI.toString
+withTable("ctas1", "ctas_with_existing_location") {
+  if (!spark.conf.get(SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS)) {
+intercept[AnalysisException] {

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631758038



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -598,6 +598,38 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
+  test("SPARK-28551: CTAS Hive Table should be with non-existent or empty 
location") {
+def executeCTASWithNonEmptyLocation(tempLocation: String) {
+  sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION " +
+s"'file:$tempLocation/ctas1'")
+  sql("INSERT INTO TABLE ctas1 SELECT 'A' ")
+  sql(s"CREATE TABLE ctas_with_existing_location stored as rcfile " +
+s"LOCATION 'file:$tempLocation' " +
+s"AS SELECT key k, value FROM src ORDER BY k, value")
+}
+
+Seq("false", "true").foreach { convertCTASFlag =>

Review comment:
   withSQLConf accepts pairs of String (String, String), passing (String, 
Boolean) will fail to compile




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631706750



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -1604,6 +1604,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ALLOW_CTAS_NON_EMPTY_LOCATION =
+buildConf("spark.sql.legacy.ctas.allowNonEmptyLocation")
+  .internal()
+  .doc("When false, Create Table As Location Select will throw Analysis 
Exception, " +
+"if the location is non empty.")

Review comment:
   Done

##
File path: docs/sql-migration-guide.md
##
@@ -78,7 +78,9 @@ license: |
   - In Spark 3.2, the timestamps subtraction expression such as `timestamp 
'2021-03-31 23:48:00' - timestamp '2021-01-01 00:00:00'` returns values of 
`DayTimeIntervalType`. In Spark 3.1 and earlier, the type of the same 
expression is `CalendarIntervalType`. To restore the behavior before Spark 3.2, 
you can set `spark.sql.legacy.interval.enabled` to `true`.
 
   - In Spark 3.2, `CREATE TABLE .. LIKE ..` command can not use reserved 
properties. You need their specific clauses to specify them, for example, 
`CREATE TABLE test1 LIKE test LOCATION 'some path'`. You can set 
`spark.sql.legacy.notReserveProperties` to `true` to ignore the 
`ParseException`, in this case, these properties will be silently removed, for 
example: `TBLPROPERTIES('owner'='yao')` will have no effect. In Spark version 
3.1 and below, the reserved properties can be used in `CREATE TABLE .. LIKE ..` 
command but have no side effects, for example, 
`TBLPROPERTIES('location'='/tmp')` does not change the location of the table 
but only create a headless property just like `'a'='b'`.
-
+ 
+  - In Spark 3.2, `CREATE TABLE AS SELECT` with non-empty `LOCATION` will 
throw `AnalysisException`. To restore the behavior before Spark 3.2, you can 
set `spark.sql.legacy.ctas.allowNonEmptyLocation` to `true`.
+  

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631706649



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -1604,6 +1604,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ALLOW_CTAS_NON_EMPTY_LOCATION =
+buildConf("spark.sql.legacy.ctas.allowNonEmptyLocation")

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631706428



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +98,23 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: 
SparkSession) {
+if (saveMode != SaveMode.Overwrite &&
+  !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = 
filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
+  if(fs != null && fs.exists(filePath)) {
+val locStats = fs.getFileStatus(filePath)
+if(locStats != null && locStats.isDirectory) {

Review comment:
   removed null check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631705687



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +98,23 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: 
SparkSession) {
+if (saveMode != SaveMode.Overwrite &&
+  !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = 
filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
+  if(fs != null && fs.exists(filePath)) {
+val locStats = fs.getFileStatus(filePath)
+if(locStats != null && locStats.isDirectory) {

Review comment:
   combined 3 if conditions to 1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631705217



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +98,23 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: 
SparkSession) {
+if (saveMode != SaveMode.Overwrite &&
+  !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = 
filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
+  if(fs != null && fs.exists(filePath)) {
+val locStats = fs.getFileStatus(filePath)
+if(locStats != null && locStats.isDirectory) {
+  val lStats = fs.listStatus(filePath)
+  if(lStats != null && lStats.length != 0) {

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-13 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631622569



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +98,23 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: 
SparkSession) {
+if (saveMode != SaveMode.Overwrite &&
+  !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = 
filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
+  if(fs != null && fs.exists(filePath)) {
+val locStats = fs.getFileStatus(filePath)
+if(locStats != null && locStats.isDirectory) {
+  val lStats = fs.listStatus(filePath)
+  if(lStats != null && lStats.length != 0) {
+throw new AnalysisException(
+  s"CREATE-TABLE-AS-SELECT cannot create table" +

Review comment:
   In Hive , 
https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java#L408
   Error message is `CREATE-TABLE-AS-SELECT cannot create table with location 
to a non-empty directory.`
   To be consistent, shall we show this message?
"CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty 
directory.  To disable the behavior, set 
'spark.sql.legacy.allowNonEmptyLocationInCTAS to true"
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.

2021-05-12 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r631550114



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -598,6 +598,38 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
+  test("SPARK-28551: CTAS Hive Table should be with non-existent or empty 
location") {
+def executeCTASWithNonEmptyLocation(tempLocation: String) {
+  sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION " +
+s"'file:$tempLocation/ctas1'")
+  sql("INSERT INTO TABLE ctas1 SELECT 'A' ")
+  sql(s"CREATE TABLE ctas_with_existing_location stored as rcfile " +
+s"LOCATION 'file:$tempLocation' " +
+s"AS SELECT key k, value FROM src ORDER BY k, value")
+}
+
+Seq("false", "true").foreach { convertCTASFlag =>
+  Seq("false", "true").foreach { allowNonEmptyDirFlag =>

Review comment:
   withSQLConf accepts pairs of String (String, String), passing (String, 
Boolean) will fail to compile 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL]CTAS with LOCATION , should not allow to a non-empty directory.

2021-05-11 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r630301266



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -1604,6 +1604,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ALLOW_CTAS_NON_EMPTY_LOCATION =
+buildConf("spark.sql.legacy.ctas.allowNonEmptyLocation")
+  .internal()
+  .doc("When false, Create Table As Location Select will throw Analysis 
Exception, " +
+"if the location is non empty.")
+  .version("3.2")

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL]CTAS with LOCATION , should not allow to a non-empty directory.

2021-05-11 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r630278821



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +98,23 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: 
SparkSession) {
+if (saveMode != SaveMode.Overwrite &&
+  !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = 
filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
+  if(fs != null && fs.exists(filePath)) {
+val locStats = fs.getFileStatus(filePath)
+if(locStats != null && locStats.isDirectory) {
+  val lStats = fs.listStatus(filePath)

Review comment:
   Yes, Hive does the same thing. [HIVE-11319]
   ref: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L13615
   The liststatus call will be slow when the location has data, which is a 
failure condition as per this PR. So in the normal case, location Dir will be 
empty and liststatus won't be slow.
   
   Later Another JIRA  HIVE-15367 added one more level of check for target 
location, which only contains the staging-dirs
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL]CTAS with LOCATION , should not allow to a non-empty directory.

2021-05-11 Thread GitBox


vinodkc commented on a change in pull request #32411:
URL: https://github.com/apache/spark/pull/32411#discussion_r630278821



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala
##
@@ -96,4 +98,23 @@ object DataWritingCommand {
   sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY),
   metrics.values.toSeq)
   }
+
+  def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: 
SparkSession) {
+if (saveMode != SaveMode.Overwrite &&
+  !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) {
+  val filePath = new org.apache.hadoop.fs.Path(tablePath)
+  val fs = 
filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
+  if(fs != null && fs.exists(filePath)) {
+val locStats = fs.getFileStatus(filePath)
+if(locStats != null && locStats.isDirectory) {
+  val lStats = fs.listStatus(filePath)

Review comment:
   Yes, Hive does the same thing. [HIVE-11319]
   ref: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L13615
   
   Later Another JIRA  HIVE-15367 added one more level of check for target 
location, which only contains the staging-dirs
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org