[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16837


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r100174684
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -892,21 +892,58 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
 if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
   val tablePath = new Path(tableMeta.location)
+  val fs = tablePath.getFileSystem(hadoopConf)
   val newParts = newSpecs.map { spec =>
+val rightPath = renamePartitionDirectory(fs, tablePath, 
partitionColumnNames, spec)
 val partition = client.getPartition(db, table, 
lowerCasePartitionSpec(spec))
-val wrongPath = new Path(partition.location)
-val rightPath = ExternalCatalogUtils.generatePartitionPath(
-  spec, partitionColumnNames, tablePath)
+partition.copy(storage = partition.storage.copy(locationUri = 
Some(rightPath.toString)))
+  }
+  alterPartitions(db, table, newParts)
+}
+  }
+
+  /**
+   * Rename the partition directory w.r.t. the actual partition columns.
+   *
+   * It will recursively rename the partition directory from the first 
partition column, to be most
+   * compatible with different file systems. e.g. in some file systems, 
renaming `a=1/b=2` to
+   * `A=1/B=2` will result to `a=1/B=2`, while in some other file systems, 
the renaming works, but
+   * will leave an empty directory `a=1`.
+   */
+  private def renamePartitionDirectory(
+  fs: FileSystem,
+  tablePath: Path,
+  partCols: Seq[String],
+  newSpec: TablePartitionSpec): Path = {
+import ExternalCatalogUtils.getPartitionPathString
+
+var totalPath = tablePath
--- End diff --

How about `currFullPath` or `fullPath`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r100038960
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala
 ---
@@ -481,4 +484,30 @@ class PartitionProviderCompatibilitySuite
   assert(spark.sql("show partitions test").count() == 5)
 }
   }
+
+  test("SPARK-19359: renaming partition should not leave useless 
directories") {
+withTable("t", "t1") {
+  Seq((1, 2, 3)).toDF("id", "A", "B").write.partitionBy("A", 
"B").saveAsTable("t")
+  spark.sql("alter table t partition(A=2, B=3) rename to 
partition(A=4, B=5)")
+
+  var table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+  var tablePath = new Path(table.location)
+  val fs = tablePath.getFileSystem(spark.sessionState.newHadoopConf())
+  // the `A=2` directory is still there, we follow this behavior from 
hive.
+  assert(fs.listStatus(tablePath)
+
.filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1)
+  assert(fs.listStatus(new Path(tablePath, 
"A=4")).count(_.isDirectory) == 1)
+
+
+  Seq((1, 2, 3, 4)).toDF("id", "A", "b", "C").write.partitionBy("A", 
"b", "C").saveAsTable("t1")
+  spark.sql("alter table t1 partition(A=2, b=3, C=4) rename to 
partition(A=4, b=5, C=6)")
+  table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t1"))
+  tablePath = new Path(table.location)
+  // the `A=2` directory is still there, we follow this behavior from 
hive.
+  assert(fs.listStatus(tablePath)
+
.filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1)
--- End diff --

I wanna check the number of partition directories except `A=2`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r18693
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala
 ---
@@ -481,4 +484,30 @@ class PartitionProviderCompatibilitySuite
   assert(spark.sql("show partitions test").count() == 5)
 }
   }
+
+  test("SPARK-19359: renaming partition should not leave useless 
directories") {
+withTable("t", "t1") {
+  Seq((1, 2, 3)).toDF("id", "A", "B").write.partitionBy("A", 
"B").saveAsTable("t")
+  spark.sql("alter table t partition(A=2, B=3) rename to 
partition(A=4, B=5)")
+
+  var table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+  var tablePath = new Path(table.location)
+  val fs = tablePath.getFileSystem(spark.sessionState.newHadoopConf())
+  // the `A=2` directory is still there, we follow this behavior from 
hive.
+  assert(fs.listStatus(tablePath)
+
.filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1)
+  assert(fs.listStatus(new Path(tablePath, 
"A=4")).count(_.isDirectory) == 1)
+
+
+  Seq((1, 2, 3, 4)).toDF("id", "A", "b", "C").write.partitionBy("A", 
"b", "C").saveAsTable("t1")
+  spark.sql("alter table t1 partition(A=2, b=3, C=4) rename to 
partition(A=4, b=5, C=6)")
+  table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t1"))
+  tablePath = new Path(table.location)
+  // the `A=2` directory is still there, we follow this behavior from 
hive.
+  assert(fs.listStatus(tablePath)
+
.filterNot(_.getPath.toString.contains("A=2")).count(_.isDirectory) == 1)
--- End diff --

If going to check `A=2` directory exist, I think here is `filter`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r9328
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -892,21 +892,59 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
 if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
   val tablePath = new Path(tableMeta.location)
+  val fs = tablePath.getFileSystem(hadoopConf)
   val newParts = newSpecs.map { spec =>
+val rightPath = renamePartitionDirectory(fs, tablePath, 
partitionColumnNames, spec)
 val partition = client.getPartition(db, table, 
lowerCasePartitionSpec(spec))
-val wrongPath = new Path(partition.location)
-val rightPath = ExternalCatalogUtils.generatePartitionPath(
-  spec, partitionColumnNames, tablePath)
+partition.copy(storage = partition.storage.copy(locationUri = 
Some(rightPath.toString)))
+  }
+  alterPartitions(db, table, newParts)
+}
+  }
+
+  /**
+   * Rename the partition directory w.r.t. the actual partition columns.
+   *
+   * It will recursively rename the partition directory from the first 
partition column, to be most
+   * compatible with different file systems. e.g. in some file systems, 
renaming `a=1/b=2` to
+   * `A=1/B=2` will result to `a=1/B=2`, while in some other file systems, 
the renaming works, but
+   * will leave an empty directory `a=1`.
+   */
+  private def renamePartitionDirectory(
+  fs: FileSystem,
+  tablePath: Path,
+  partCols: Seq[String],
+  newSpec: TablePartitionSpec): Path = {
+import ExternalCatalogUtils.getPartitionPathString
+
+var expectedPartitionPath = tablePath
+var actualPartitionPath = tablePath
+partCols.foreach { col =>
+  val partitionValue = newSpec(col)
+  val expectedPartitionPathString = getPartitionPathString(col, 
partitionValue)
+  val actualPartitionPathString = 
getPartitionPathString(col.toLowerCase, partitionValue)
+
+  expectedPartitionPath = new Path(expectedPartitionPath, 
expectedPartitionPathString)
+  actualPartitionPath = new Path(actualPartitionPath, 
actualPartitionPathString)
--- End diff --

oh yea, we should always rename directory within one nested level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r8413
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -892,21 +892,59 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
 if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
   val tablePath = new Path(tableMeta.location)
+  val fs = tablePath.getFileSystem(hadoopConf)
   val newParts = newSpecs.map { spec =>
+val rightPath = renamePartitionDirectory(fs, tablePath, 
partitionColumnNames, spec)
 val partition = client.getPartition(db, table, 
lowerCasePartitionSpec(spec))
-val wrongPath = new Path(partition.location)
-val rightPath = ExternalCatalogUtils.generatePartitionPath(
-  spec, partitionColumnNames, tablePath)
+partition.copy(storage = partition.storage.copy(locationUri = 
Some(rightPath.toString)))
+  }
+  alterPartitions(db, table, newParts)
+}
+  }
+
+  /**
+   * Rename the partition directory w.r.t. the actual partition columns.
+   *
+   * It will recursively rename the partition directory from the first 
partition column, to be most
+   * compatible with different file systems. e.g. in some file systems, 
renaming `a=1/b=2` to
+   * `A=1/B=2` will result to `a=1/B=2`, while in some other file systems, 
the renaming works, but
+   * will leave an empty directory `a=1`.
+   */
+  private def renamePartitionDirectory(
+  fs: FileSystem,
+  tablePath: Path,
+  partCols: Seq[String],
+  newSpec: TablePartitionSpec): Path = {
+import ExternalCatalogUtils.getPartitionPathString
+
+var expectedPartitionPath = tablePath
+var actualPartitionPath = tablePath
+partCols.foreach { col =>
+  val partitionValue = newSpec(col)
+  val expectedPartitionPathString = getPartitionPathString(col, 
partitionValue)
+  val actualPartitionPathString = 
getPartitionPathString(col.toLowerCase, partitionValue)
+
+  expectedPartitionPath = new Path(expectedPartitionPath, 
expectedPartitionPathString)
+  actualPartitionPath = new Path(actualPartitionPath, 
actualPartitionPathString)
--- End diff --

Since we rename previous `actualPartitionPath` to `expectedPartitionPath` 
in last run, `actualPartitionPath` doesn't exist anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-07 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r99981498
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -892,21 +892,61 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
 if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
   val tablePath = new Path(tableMeta.location)
+  val fs = tablePath.getFileSystem(hadoopConf)
   val newParts = newSpecs.map { spec =>
+val rightPath = renamePartitionDirectory(fs, tablePath, 
partitionColumnNames, spec)
 val partition = client.getPartition(db, table, 
lowerCasePartitionSpec(spec))
-val wrongPath = new Path(partition.location)
-val rightPath = ExternalCatalogUtils.generatePartitionPath(
-  spec, partitionColumnNames, tablePath)
+partition.copy(storage = partition.storage.copy(locationUri = 
Some(rightPath.toString)))
+  }
+  alterPartitions(db, table, newParts)
+}
+  }
+
+  /**
+   * Rename the partition directory w.r.t. the actual partition columns.
+   *
+   * It will recursively rename the partition directory from the first 
partition column, to be most
+   * compatible with different file systems. e.g. in some file systems, 
renaming `a=1/b=2` to
+   * `A=1/B=2` will result to `a=1/B=2`, while in some other file systems, 
the renaming works, but
+   * will leave an empty directory `a=1`.
+   */
+  private def renamePartitionDirectory(
+  fs: FileSystem,
+  tablePath: Path,
+  partCols: Seq[String],
+  newSpec: TablePartitionSpec): Path = {
+import ExternalCatalogUtils._
+
+var totalPath = tablePath
+partCols.foreach { col =>
+  val partitionValue = newSpec(col)
+  val partitionString = if (partitionValue == null) {
+DEFAULT_PARTITION_NAME
+  } else {
+escapePathName(partitionValue)
+  }
+
+  val nextPartPath = new Path(totalPath, escapePathName(col) + "=" + 
partitionString)
+  if (fs.exists(nextPartPath)) {
+// It is possible that some parental partition directories already 
exist or doesn't need to
+// be renamed. e.g. the partition columns are `a` and `B`, then we 
don't need to rename
+// `/table_path/a=1`. Or we already have a partition directory 
`A=1/B=2`, and we rename
+// another partition to `A=1/B=3`, then we will have `A=1/B=2` and 
`a=1/b=3`, and we should
+// just move `a=1/b=3` into `A=1` with new name `B=3`.
+  } else {
+val wrongPartPath = new Path(
--- End diff --

good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r99980560
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -892,21 +892,61 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
 if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
   val tablePath = new Path(tableMeta.location)
+  val fs = tablePath.getFileSystem(hadoopConf)
   val newParts = newSpecs.map { spec =>
+val rightPath = renamePartitionDirectory(fs, tablePath, 
partitionColumnNames, spec)
 val partition = client.getPartition(db, table, 
lowerCasePartitionSpec(spec))
-val wrongPath = new Path(partition.location)
-val rightPath = ExternalCatalogUtils.generatePartitionPath(
-  spec, partitionColumnNames, tablePath)
+partition.copy(storage = partition.storage.copy(locationUri = 
Some(rightPath.toString)))
+  }
+  alterPartitions(db, table, newParts)
+}
+  }
+
+  /**
+   * Rename the partition directory w.r.t. the actual partition columns.
+   *
+   * It will recursively rename the partition directory from the first 
partition column, to be most
+   * compatible with different file systems. e.g. in some file systems, 
renaming `a=1/b=2` to
+   * `A=1/B=2` will result to `a=1/B=2`, while in some other file systems, 
the renaming works, but
+   * will leave an empty directory `a=1`.
+   */
+  private def renamePartitionDirectory(
+  fs: FileSystem,
+  tablePath: Path,
+  partCols: Seq[String],
+  newSpec: TablePartitionSpec): Path = {
+import ExternalCatalogUtils._
+
+var totalPath = tablePath
+partCols.foreach { col =>
+  val partitionValue = newSpec(col)
+  val partitionString = if (partitionValue == null) {
+DEFAULT_PARTITION_NAME
+  } else {
+escapePathName(partitionValue)
+  }
+
+  val nextPartPath = new Path(totalPath, escapePathName(col) + "=" + 
partitionString)
+  if (fs.exists(nextPartPath)) {
+// It is possible that some parental partition directories already 
exist or doesn't need to
+// be renamed. e.g. the partition columns are `a` and `B`, then we 
don't need to rename
+// `/table_path/a=1`. Or we already have a partition directory 
`A=1/B=2`, and we rename
+// another partition to `A=1/B=3`, then we will have `A=1/B=2` and 
`a=1/b=3`, and we should
+// just move `a=1/b=3` into `A=1` with new name `B=3`.
+  } else {
+val wrongPartPath = new Path(
--- End diff --

The basepath of `wrongPartPath` needs to update with the column in previous 
iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-07 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/16837

[SPARK-19359][SQL] renaming partition should not leave useless directories

## What changes were proposed in this pull request?

when we rename a nested path, different file systems have different 
behaviors. e.g. in jenkins, renaming `a=1/b=2` to `A=2/B=2` will success, but 
leave an empty directory `a=1`. in mac os, the renaming will fail and result to 
`a=1/B=2`.

This PR rename the partition path recursively from the first partition 
column, to be most compatible with different file systems.

## How was this patch tested?

new regression test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark partition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16837.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16837


commit 47254448c60cbf92921a8140cf0f513a3329b0ff
Author: Wenchen Fan 
Date:   2017-02-07T17:28:23Z

renaming partition should not leave useless directories




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org