[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-17 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r456232227



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (scheme.equals("file")) {
+  logWarning(s"Temporary data will be written into a local file system " +
+s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in 
local mode, " +
+s"you might need to configure 'hive.exec.scratchdir' " +
+s"to use accessible file system (e.g. HDFS path) from any executors in 
the cluster.")

Review comment:
   Removed unnecessary `s` in the head. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-17 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r456294954



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,7 +99,42 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (scheme.equals("file")) {
+  logWarning("Temporary data will be written into a local file system " +
+"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in 
local mode, " +

Review comment:
   Oh it was my bad. Reverted only the necessary part.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-16 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r456232227



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (scheme.equals("file")) {
+  logWarning(s"Temporary data will be written into a local file system " +
+s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in 
local mode, " +
+s"you might need to configure 'hive.exec.scratchdir' " +
+s"to use accessible file system (e.g. HDFS path) from any executors in 
the cluster.")

Review comment:
   Removed `s` in the head. BTW there are a lot of existing code which 
includes it, but I left it as it is.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-07-16 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r456231877



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (scheme.equals("file")) {
+  logWarning(s"Temporary data will be written into a local file system " +
+s"(scheme: '$scheme', path: '$mrScratchDir'). If your Spark is not in 
local mode, " +
+s"you might need to configure 'hive.exec.scratchdir' " +
+s"to use accessible file system (e.g. HDFS path) from any executors in 
the cluster.")
+}
+path
+  }
+
+  private def supportSchemeToUseNonBlobStore(path: Path): Boolean = {
+path != null && {
+  val supportedBlobSchemes = SQLConf.get.supportedSchemesToUseNonBlobstore
+  val scheme = Option(path.toUri.getScheme).getOrElse("")
+  
Utils.stringToSeq(supportedBlobSchemes).contains(scheme.toLowerCase(Locale.ROOT))
+}
+  }
+
+  def getExternalTmpPath(
   sparkSession: SparkSession,
   hadoopConf: Configuration,
   path: Path): Path = {
 import org.apache.spark.sql.hive.client.hive._
-

Review comment:
   Thanks for pointing it. Reverted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-28 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446739501



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (schema == "file") {
+  logWarning(s"Temporary data will be written into a local disk " +

Review comment:
   If Spark is running in local mode or Spark uses NFS, it works even if 
`file` is used.
   If we throw an exception here, we cannot cover those use-cases.
   
   IMO, it is similar to the fact that Spark can still use RawLocalFileSystem 
(file://).

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,46 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+val path = new Path(mrScratchDir, "-mr-1")
+val scheme = Option(path.toUri.getScheme).getOrElse("")
+if (schema == "file") {
+  logWarning(s"Temporary data will be written into a local disk " +

Review comment:
   If Spark is running in local mode or Spark uses NFS, it works even if 
`file` is used.
   If we throw an exception here, we cannot cover those use-cases.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-28 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446738298



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +162,25 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
-  newVersionExternalTempPath(path, hadoopConf, stagingDir)
+  // HIVE-14270: Write temporary data to HDFS when doing inserts on tables 
located on S3

Review comment:
   Makes sense. Modified to use `SPARK-21514` in this comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-28 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446614786



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   It makes sense to me.Thank you for the suggestion.
   I added an error message to suggest end-users to configure 
`hive.exec.scratchdir` if the scheme of the temporary directory is `file`.
   Can you check if it is clear enough for end-users?
   
   ```
   val path = new Path(mrScratchDir, "-mr-1")
   val scheme = Option(path.toUri.getScheme).getOrElse("")
   if (schema == "file") {
 logWarning(s"Temporary data will be written into a local disk " +
   s"(scheme: '$scheme', path: '$mrScratchDir'). " +
   s"You need to configure 'hive.exec.scratchdir' to use accessible 
location " +
   s"(e.g. HDFS path) from any executors in the cluster.")
   }
   ```
   
   Note: I also added one test case for HDFS.

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   It makes sense to me. Thank you for the suggestion.
   I added an error message to suggest end-users to configure 
`hive.exec.scratchdir` if the scheme of the temporary directory is `file`.
   Can you check if it is clear enough for end-users?
   
   ```
   val path = new Path(mrScratchDir, "-mr-1")
   val scheme = Option(path.toUri.getScheme).getOrElse("")
   if (schema == "file") {
 logWarning(s"Temporary data will be written into a local disk " +
   s"(scheme: '$scheme', path: '$mrScratchDir'). " +
   s"You need to configure 'hive.exec.scratchdir' to use accessible 
location " +
   s"(e.g. HDFS path) from any executors in the cluster.")
   }
   ```
   
   Note: I also added one test case for HDFS.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-28 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446614786



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   It makes sense to me.Thank you for the suggestion.
   I added an error message to suggest end-users to configure 
`hive.exec.scratchdir` if the scheme of the temporary directory is `file`.
   Can you check if it is clear enough for end-users?
   
   ```
   val path = new Path(mrScratchDir, "-mr-1")
   val scheme = Option(path.toUri.getScheme).getOrElse("")
   if (schema == "file") {
 logWarning(s"Temporary data will be written into a local disk " +
   s"(scheme: '$scheme', path: '$mrScratchDir'). " +
   s"You need to configure 'hive.exec.scratchdir' to use accessible 
location " +
   s"(e.g. HDFS path) from any executors in the cluster.")
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-27 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446516892



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Hive has two kinds of scratch dir accordingly, one in local, the other 
in hdfs.
   https://mingyue.me/2018/11/17/hive-scratch-working-directory/
   
   In this pull-request, the latter one, `hive.exec.scratchdir` is used. In my 
recognition, we can assume HDFS schema in most cases.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-27 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446604034



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Currently we just rely on `hive.exec.scratchdir` (not directly on 
`fs.default.name`), and it works in most use cases even if 
`hive.exec.scratchdir` is not configured explicitly. 
   I do not want to restrict this feature to HDFS only because I have seen some 
clusters which do not have HDFS. I want to let end-users choose any scheme 
where they want to store temporary data.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-27 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446604034



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Currently we just rely on `hive.exec.scratchdir`, and it works in most 
use cases even if `hive.exec.scratchdir` is not configured explicitly. 
   I do not want to restrict this feature to HDFS only because I have seen some 
clusters which do not have HDFS. I want to let end-users choose any scheme 
where they want to store temporary data.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-27 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446588070



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Here, no specific scheme is configured in `hive.exec.scratchdir` as 
default value. The default is just `/tmp/hive` without scheme. It means that 
the default scheme determined by `fs.default.name` is used. 
   If `fs.default.name` is a local scheme like `file://` then `file://` will be 
used. If `fs.default.name` is a hdfs scheme like `hdfs://host:port/` then 
`hdfs://host:port/` will be used. I confirmed this behavior in my end to make 
sure it.
   
   In most cases, `fs.default.name` is a hdfs scheme. 
   Even if local scheme is used here, it should work because we can assume 
local scheme can be used in such environment (e.g. unit testing).
   
   Of course, users have a full control on `hive.exec.scratchdir` and 
`fs.default.name`, so I believe it won’t be a problem.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-27 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446588070



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Here, no specific scheme is configured in `hive.exec.scratchdir` (just 
`/tmp/hive` without scheme). It means that the default scheme determined by 
`fs.default.name` is used. 
   If `fs.default.name` is a local scheme like `file://` then `file://` will be 
used. If `fs.default.name` is a hdfs scheme like `hdfs://host:port/` then 
`hdfs://host:port/` will be used. I confirmed this behavior in my end to make 
sure it.
   
   In most cases, `fs.default.name` is a hdfs scheme. 
   Even if local scheme is used here, it should work because we can assume 
local scheme can be used in such environment (e.g. unit testing).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-27 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446516892



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,12 +99,38 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getNonBlobTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Hive has two kinds of scratch dir accordingly, one in local, the other 
in hdfs.
   https://mingyue.me/2018/11/17/hive-scratch-working-directory/
   
   In this pull-request, the latter one, `hive.exec.scratchdir` is used. In my 
recognition, we can assume HDFS schema here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446482526



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
-  newVersionExternalTempPath(path, hadoopConf, stagingDir)
+  // HIVE-14270: Write temporary data to HDFS when doing inserts on tables 
located on S3
+  // Copied from Context.java#getTempDirForPath of Hive 2.3.
+  if (supportSchemeToUseNonBlobStore(path)) {
+// Hive sets session_path as 
HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config
+val HDFS_SESSION_PATH_KEY = "_hive.hdfs.session.path"
+val sessionScratchDir = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
+  .client.getConf(HDFS_SESSION_PATH_KEY, "")
+logDebug(s"session scratch dir '$sessionScratchDir' is used")
+getMRTmpPath(hadoopConf, sessionScratchDir, scratchDir)

Review comment:
   Yes, Hive's `scratchDir` is still used in Hive 2.0 or later.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446482475



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
-  newVersionExternalTempPath(path, hadoopConf, stagingDir)
+  // HIVE-14270: Write temporary data to HDFS when doing inserts on tables 
located on S3

Review comment:
   Added version check using `hiveVersionsSupportingNonBlobstoreTempPath`.
   
   ```
   val hiveVersionsSupportingNonBlobstoreTempPath: Set[HiveVersion] =
 Set(v2_0, v2_1, v2_2, v2_3, v3_0, v3_1)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-26 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r446482392



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,7 +99,34 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getMRTmpPath(

Review comment:
   Renamed this method from `getMRTmpPath` to `getNonBlobTmpPath`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-16 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r441212073



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -839,6 +839,17 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
 
+  val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE =
+buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore")
+  .doc("Comma-separated list of supported blobstore schemes (e.g. 
's3,s3a'). " +
+"If any blobstore schemes are specified, this feature is enabled. " +
+"When writing data out to a Hive table, " +
+"Spark writes the data first into non blobstore storage, and then 
moves it to blobstore. " +
+"By default, this option is set to empty. It means this feature is 
disabled.")
+  .version("3.1.0")
+  .stringConf
+  .createWithDefault("")

Review comment:
   If all the blob storage services have the same characteristics, then it 
will be possible. However, actually we cannot guarantee it.
   I believe that it is better to allow users to enable/disable this feature 
per scheme.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439917190



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {

Review comment:
   Got it. I added the description "This option is supported in Hive 2.0 or 
later." in SQLConf.scala.
   
https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439917190



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {

Review comment:
   Got it. I added the descroption "This option is supported in Hive 2.0 or 
later." in SQLConf.scala.
   
https://github.com/apache/spark/pull/27690/files#diff-9a6b543db706f1a90f790783d6930a13R849





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439913882



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -839,6 +839,17 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
 
+  val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE =
+buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore")
+  .doc("Comma-separated list of supported blobstore schemes (e.g. 
's3,s3a'). " +
+"If any blobstore schemes are specified, this feature is enabled. " +
+"When writing data out to a Hive table, " +
+"Spark writes the data first into non blobstore storage, and then 
moves it to blobstore. " +
+"By default, this option is set to empty. It means this feature is 
disabled.")
+  .version("3.1.0")
+  .stringConf
+  .createWithDefault("")

Review comment:
   Note: I am not 100% sure whether all these blob storage systems have 
similar characteristics and not sure if this option is effective. At least, 
this option is effective for Amazon S3.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-14 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439913383



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -839,6 +839,17 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
 
+  val HIVE_SUPPORTED_SCHEMES_TO_USE_NONBLOBSTORE =
+buildConf("spark.sql.hive.supportedSchemesToUseNonBlobstore")
+  .doc("Comma-separated list of supported blobstore schemes (e.g. 
's3,s3a'). " +
+"If any blobstore schemes are specified, this feature is enabled. " +
+"When writing data out to a Hive table, " +
+"Spark writes the data first into non blobstore storage, and then 
moves it to blobstore. " +
+"By default, this option is set to empty. It means this feature is 
disabled.")
+  .version("3.1.0")
+  .stringConf
+  .createWithDefault("")

Review comment:
   Users can specify any blob storage schema like following. If copy 
operation is expensive in the storage system, this option will be effective.
   - Amazon S3: `s3`, `s3a`, `s3n`
   - Azure Blob Storage: `wasb`, `wasbs`
   - Google Cloud Storage: `gs`
   - Databricks: `dbfs`
   - OpenStack: `swift`
   
   Since any schemes are possible to be used, I believe we cannot define 
specific supported schemes here. That's why I just listed samples in 
SQLConf.scala.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-12 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439706256



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -97,7 +99,34 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  // Mostly copied from Context.java#getMRTmpPath of Hive 2.3.
+  // Visible for testing.
+  private[execution] def getMRTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath.
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090.
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir.
+val sessionPath = if (!sessionScratchDir.isEmpty) sessionScratchDir else 
scratchDir
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)

Review comment:
   Several HDFS scratch directories are created during start SessionState.
   If the session scratch directory is created in the path specified in 
`_hive.hdfs.session.path`, the directory should be used. If it is not 
specified, then we just use scratchDir.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-12 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r439705159



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
##
@@ -124,11 +153,24 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir' are used")
 
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {

Review comment:
   It is because old versions do not support data copy between different 
type of DFSs.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-04-19 Thread GitBox


moomindani commented on a change in pull request #27690:
URL: https://github.com/apache/spark/pull/27690#discussion_r411040014



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -751,6 +751,22 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
 
+  val HIVE_BLOBSTORE_SUPPORTED_SCHEMES =
+buildConf("spark.sql.hive.blobstore.supported.schemes")
+  .doc("Comma-separated list of supported blobstore schemes.")

Review comment:
   Added the lines into the new configurations.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396830941
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##
 @@ -751,6 +751,20 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
 
+  val HIVE_BLOBSTORE_SUPPORTED_SCHEMES =
+buildConf("spark.sql.hive.blobstore.supported.schemes")
+  .doc("Comma-separated list of supported blobstore schemes.")
 
 Review comment:
   Modified.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396830905
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
+   *
+   */
+  def getMRTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir
+val sessionPath = 
Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)
 
 Review comment:
   Modified.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396829010
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
+   *
+   */
+  def getMRTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir
+val sessionPath = 
Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)
+logDebug(s"session path '${sessionPath.toString}' is used")
+
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+new Path(mrScratchDir, "-mr-1")
+  }
+
+  private def isBlobStoragePath(path: Path): Boolean = {
+path != null && 
isBlobStorageScheme(Option(path.toUri.getScheme).getOrElse(""))
 
 Review comment:
   I believe that there are only two cases where S3 is used as scratch dir even 
for S3 path with `spark.sql.hive.blobstore.use.blobstore.as.scratchdir=false`.
   - When `_hive.hdfs.session.path` is set to `s3`
   - When `_hive.hdfs.session.path` is not set and `hive.exec.scratchdir` is 
set to `s3`
   
   In both cases, these params can be configured by users, so it won't be an 
issue.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396799099
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
+   *
+   */
+  def getMRTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir
+val sessionPath = 
Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)
+logDebug(s"session path '${sessionPath.toString}' is used")
+
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+new Path(mrScratchDir, "-mr-1")
+  }
+
+  private def isBlobStoragePath(path: Path): Boolean = {
+path != null && 
isBlobStorageScheme(Option(path.toUri.getScheme).getOrElse(""))
 
 Review comment:
   As far as I tried, it worked even when `fs.default.name` is set to `s3.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396799099
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
+   *
+   */
+  def getMRTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir
+val sessionPath = 
Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)
+logDebug(s"session path '${sessionPath.toString}' is used")
+
+val mrScratchDir = oldVersionExternalTempPath(new Path(sessionPath), 
hadoopConf, sessionPath)
+logDebug(s"MR scratch dir '$mrScratchDir/-mr-1' is used")
+new Path(mrScratchDir, "-mr-1")
+  }
+
+  private def isBlobStoragePath(path: Path): Boolean = {
+path != null && 
isBlobStorageScheme(Option(path.toUri.getScheme).getOrElse(""))
 
 Review comment:
   As far as I tried, it worked even when `fs.default.name` is set to `s3`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396289647
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +99,43 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
+   *
+   */
+  def getMRTmpPath(
+  hadoopConf: Configuration,
+  sessionScratchDir: String,
+  scratchDir: String): Path = {
+
+// Hive's getMRTmpPath uses nonLocalScratchPath + '-mr-1',
+// which is ruled by 'hive.exec.scratchdir' including file system.
+// This is the same as Spark's #oldVersionExternalTempPath
+// Only difference between #oldVersionExternalTempPath and Hive 2.3.0's is 
HIVE-7090
+// HIVE-7090 added user_name/session_id on top of 'hive.exec.scratchdir'
+// Here it uses session_path unless it's emtpy, otherwise uses scratchDir
+val sessionPath = 
Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)
 
 Review comment:
   Thank you for the comment. As you said, `Option` is not needed here.
   I will re-write this with `val sessionPath = if (!sessionScratchDir.isEmpty) 
sessionScratchDir else scratchDir`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396285657
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -125,10 +163,22 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
 
+// Hive sets session_path as 
HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config
+val sessionScratchDir = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
+  .client.getConf("_hive.hdfs.session.path", "")
 
 Review comment:
   As you mentioned in another comment, empty string is retrieved when 
`_hive.hdfs.session.path` is not set.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-23 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396284464
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##
 @@ -751,6 +751,20 @@ object SQLConf {
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
 .createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
 
+  val HIVE_BLOBSTORE_SUPPORTED_SCHEMES =
+buildConf("spark.sql.hive.blobstore.supported.schemes")
+  .doc("Comma-separated list of supported blobstore schemes.")
 
 Review comment:
   Thank you for the comment. Current explanation is the same sentence as 
Hive's. I will add explanation like `If you disable this parameter, Spark 
writes the data first in scratch dir, and move it to blobstore because moving 
it on blobstore is expensive.` based on your comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-22 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396179637
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -125,10 +163,22 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
 
+// Hive sets session_path as 
HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config
+val sessionScratchDir = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
+  .client.getConf("_hive.hdfs.session.path", "")
 
 Review comment:
   It is tested in this unit test.
   
https://github.com/apache/spark/pull/27690/files#diff-ee422d26750ba346c81b7f85b4b14577R46


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-22 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r396178891
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -125,10 +163,22 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
 
+// Hive sets session_path as 
HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config
+val sessionScratchDir = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
+  .client.getConf("_hive.hdfs.session.path", "")
 
 Review comment:
   If `_hive.hdfs.session.path` is empty, `getMRTmpPath` uses `scratchDir` 
instead of `sessionScratchDir` in this line:  `val sessionPath = 
Option(sessionScratchDir).filterNot(_.isEmpty).getOrElse(scratchDir)`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-03 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r387429262
 
 

 ##
 File path: sql/hive/pom.xml
 ##
 @@ -189,6 +189,11 @@
   scalacheck_${scala.binary.version}
   test
 
+
+  org.mockito
+  mockito-core
+  test
 
 Review comment:
   Thanks, I removed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-01 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r386197390
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -125,10 +162,23 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
 
+// Hive sets session_path as 
HDFS_SESSION_PATH_KEY(_hive.hdfs.session.path) in hive config
+val sessionScratchDir = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
+  .client.getConf("_hive.hdfs.session.path", "")
+logDebug(s"path '${path.toString}', staging dir '$stagingDir', " +
+  s"scratch dir '$scratchDir', session scratch dir '$sessionScratchDir' 
are used")
+
 if (hiveVersionsUsingOldExternalTempPath.contains(hiveVersion)) {
   oldVersionExternalTempPath(path, hadoopConf, scratchDir)
 } else if (hiveVersionsUsingNewExternalTempPath.contains(hiveVersion)) {
-  newVersionExternalTempPath(path, hadoopConf, stagingDir)
+  // HIVE-14270: Write temporary data to HDFS when doing inserts on tables 
located on S3
+  // Copied from Context.java#getTempDirForPath of Hive 2.3
+  if (isBlobStoragePath(hadoopConf, path)
+&& !useBlobStorageAsScratchDir(hadoopConf)) {
+getMRTmpPath(hadoopConf, sessionScratchDir, scratchDir)
 
 Review comment:
   Although `I used spark.hadoop.hive.blobstore.*` in the initial commit, it 
was for consistency.
   Since I do not have any technical reason here, I changed it to use SQLConf 
to follow the naming rule here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-03-01 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r386185658
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFileSuite.scala
 ##
 @@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.scalatestplus.mockito.MockitoSugar
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+
+class SaveAsHiveFileSuite extends QueryTest with TestHiveSingleton with 
MockitoSugar {
+  test("sessionScratchDir = '/tmp/hive/user_a/session_b' & scratchDir = 
'/tmp/hive_scratch'") {
+val insertIntoHiveTable = new InsertIntoHiveTable(
+  mock[CatalogTable], Map.empty[String, Option[String]],
+  mock[LogicalPlan], true, false, Seq.empty[String])
 
 Review comment:
   It makes sense to me. Thanks, I did not notice it.
   I removed mockito and just use null per the link.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-29 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r386026628
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFileSuite.scala
 ##
 @@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.scalatest.GivenWhenThen
+import org.scalatestplus.mockito.MockitoSugar
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+
+class SaveAsHiveFileSuite extends QueryTest with TestHiveSingleton
+with MockitoSugar with GivenWhenThen {
 
 Review comment:
   In order to test this patch's exact behavior, we might need to check if the 
intermediate files are located in HDFS not in blobstore. However, it is not so 
easy to test it. That's the reason why I am checking each path here instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-29 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r386026502
 
 

 ##
 File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFileSuite.scala
 ##
 @@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.scalatest.GivenWhenThen
+import org.scalatestplus.mockito.MockitoSugar
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+
+class SaveAsHiveFileSuite extends QueryTest with TestHiveSingleton
+with MockitoSugar with GivenWhenThen {
+  test("getMRTmpPath method") {
+val insertIntoHiveTable = new InsertIntoHiveTable(
+  mock[CatalogTable], Map.empty[String, Option[String]],
+  mock[LogicalPlan], true, false, Seq.empty[String])
+val hadoopConf = new Configuration()
+val scratchDir = "/tmp/hive_scratch"
+val sessionScratchDir = "/tmp/hive/user_a/session_b"
+
+Given(s"sessionScratchDir = '$sessionScratchDir' & scratchDir = 
'$scratchDir'")
+When("get the path from getMRTmpPath(hadoopConf, sessionScratchDir, 
scratchDir)")
 
 Review comment:
   I see, I changed to use simple `assert` instead of `GivenWhenThen`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-29 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r386024959
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala
 ##
 @@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+object BlobStorageUtils {
 
 Review comment:
   I moved the functions to private functions in `SaveAsHiveFile`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-29 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r386023688
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
 
 Review comment:
   I checked `hive/Context.java` in the master branch, and I did not see much 
difference.
   - master: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Context.java
   - release 2.3.0: 
https://github.com/apache/hive/blob/rel/release-2.3.0/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L588


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-27 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r385497093
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala
 ##
 @@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+object BlobStorageUtils {
+  def isBlobStoragePath(
+  hadoopConf: Configuration,
+  path: Path): Boolean = {
+path != null && isBlobStorageScheme(hadoopConf, 
Option(path.toUri.getScheme).getOrElse(""))
+  }
+
+  def isBlobStorageScheme(
+  hadoopConf: Configuration,
+  scheme: String): Boolean = {
 
 Review comment:
   Removed unneeded line breaks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-27 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r385497038
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala
 ##
 @@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+object BlobStorageUtils {
+  def isBlobStoragePath(
+  hadoopConf: Configuration,
+  path: Path): Boolean = {
 
 Review comment:
   Removed unneeded line breaks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-27 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r385496980
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala
 ##
 @@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+object BlobStorageUtils {
+  def isBlobStoragePath(
+  hadoopConf: Configuration,
+  path: Path): Boolean = {
+path != null && isBlobStorageScheme(hadoopConf, 
Option(path.toUri.getScheme).getOrElse(""))
+  }
+
+  def isBlobStorageScheme(
+  hadoopConf: Configuration,
+  scheme: String): Boolean = {
+val supportedBlobSchemes = 
hadoopConf.get("hive.blobstore.supported.schemes", "s3,s3a,s3n")
+supportedBlobSchemes.toLowerCase(Locale.ROOT)
+  .split(",")
+  .map(_.trim)
+  .contains(scheme.toLowerCase(Locale.ROOT))
 
 Review comment:
   I replaced this with `Utils.stringToSeq`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-27 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r385496854
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -124,11 +147,26 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 val hiveVersion = 
externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client.version
 val stagingDir = hadoopConf.get("hive.exec.stagingdir", ".hive-staging")
 val scratchDir = hadoopConf.get("hive.exec.scratchdir", "/tmp/hive")
+logDebug(s"path '${path.toString}' is used")
+logDebug(s"staging dir '$stagingDir' is used")
+logDebug(s"scratch dir '$scratchDir' is used")
 
 Review comment:
   Yes, I packed these 4 (3+1) logDebug into one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-27 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r385160626
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
 
 Review comment:
   I believe that currently Spark's default Hive version is 2.3, so I just used 
the same version of Hive here for simplicity.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-27 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r385160313
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
 
 Review comment:
   I believe that currently Spark's default Hive version is 2.3, so I just used 
the same version of Hive here for simplicity.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-26 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r384856247
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/BlobStorageUtils.scala
 ##
 @@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+object BlobStorageUtils {
+  def isBlobStoragePath(
+  hadoopConf: Configuration,
+  path: Path): Boolean = {
+path != null && isBlobStorageScheme(hadoopConf, 
Option(path.toUri.getScheme).getOrElse(""))
+  }
+
+  def isBlobStorageScheme(
+  hadoopConf: Configuration,
+  scheme: String): Boolean = {
+val supportedBlobSchemes = 
hadoopConf.get("hive.blobstore.supported.schemes", "s3,s3a,s3n")
+supportedBlobSchemes.toLowerCase(Locale.ROOT)
+  .split(",")
+  .map(_.trim)
+  .toList
 
 Review comment:
   Thanks for the comment. Right, we do not need `toList`.
   I removed it, and confirmed that unit test succeeds.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-02-26 Thread GitBox
moomindani commented on a change in pull request #27690: [SPARK-21514][SQL] 
Added a new option to use non-blobstore storage when writing into blobstore 
storage
URL: https://github.com/apache/spark/pull/27690#discussion_r384853033
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -97,7 +97,30 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
   options = Map.empty)
   }
 
-  protected def getExternalTmpPath(
+  /*
+   * Mostly copied from Context.java#getMRTmpPath of Hive 2.3
 
 Review comment:
   Yes, it is coming from this. 
https://github.com/apache/hive/blob/rel/release-2.3.0/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L588


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org