[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata

2020-05-22 Thread GitBox


cloud-fan commented on a change in pull request #28610:
URL: https://github.com/apache/spark/pull/28610#discussion_r429102294



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##
@@ -116,6 +116,30 @@ class DataSourceScanExecRedactionSuite extends 
DataSourceScanRedactionTest {
   assert(isIncluded(df.queryExecution, "Location"))
 }
   }
+
+  test("FileSourceScanExec metadata should contain limited file paths") {
+withTempPath { path =>
+  val dir = path.getCanonicalPath
+  val partitionCol = "partitionCol"
+  spark.range(10)
+.select("id", "id")
+.toDF("value", partitionCol)
+.write
+.partitionBy(partitionCol)
+.orc(dir)
+  val paths = (0 to 9).map(i => s"$dir/$partitionCol=$i")

Review comment:
   does this work for Windows? should `new File(path, s"$partitionCol=$i"). 
getCanonicalPath` be more robust?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata

2020-05-22 Thread GitBox


cloud-fan commented on a change in pull request #28610:
URL: https://github.com/apache/spark/pull/28610#discussion_r429091272



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##
@@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends 
DataSourceScanRedactionTest {
   assert(isIncluded(df.queryExecution, "Location"))
 }
   }
+
+  test("FileSourceScanExec metadata should contain limited file paths") {
+withTempPath { path =>
+  val dir = path.getCanonicalPath
+  val partitionCol = "partitionCol"
+  spark.range(10)
+.select("id", "id")
+.toDF("value", partitionCol)
+.write
+.partitionBy(partitionCol)
+.orc(dir)
+  val paths = (0 to 9).map(i => dir + "/" + partitionCol + "=" + i)
+  val plan = spark
+.read
+.orc(paths: _*)
+.queryExecution
+.executedPlan
+  val location = plan collectFirst {
+case f: FileSourceScanExec => f.metadata("Location")
+  }
+  assert(location.isDefined)
+  var found = false

Review comment:
   how about `assert(location.drop(1).dropRight(1).split(",").length < 
paths.length)`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata

2020-05-22 Thread GitBox


cloud-fan commented on a change in pull request #28610:
URL: https://github.com/apache/spark/pull/28610#discussion_r429090593



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
##
@@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends 
DataSourceScanRedactionTest {
   assert(isIncluded(df.queryExecution, "Location"))
 }
   }
+
+  test("FileSourceScanExec metadata should contain limited file paths") {
+withTempPath { path =>
+  val dir = path.getCanonicalPath
+  val partitionCol = "partitionCol"
+  spark.range(10)
+.select("id", "id")
+.toDF("value", partitionCol)
+.write
+.partitionBy(partitionCol)
+.orc(dir)
+  val paths = (0 to 9).map(i => dir + "/" + partitionCol + "=" + i)

Review comment:
   nit: `new File(path, partitionCol + "=" + i)`, to be Windows friendly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata

2020-05-22 Thread GitBox


cloud-fan commented on a change in pull request #28610:
URL: https://github.com/apache/spark/pull/28610#discussion_r429089781



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -2904,6 +2904,23 @@ private[spark] object Utils extends Logging {
 props.forEach((k, v) => resultProps.put(k, v))
 resultProps
   }
+
+  /**
+   * Convert a sequence of [[Path]] to a metadata string. When the length of 
metadata string
+   * exceeds `stopAppendingThreshold`, stop appending paths for saving memory.
+   */
+  def buildLocationMetadata(paths: Seq[Path], stopAppendingThreshold: Int): 
String = {
+var metadata = "["
+var index: Int = 0
+while (index < paths.length && metadata.length <= stopAppendingThreshold) {
+  if (index > 0) {
+metadata += ", "
+  }
+  metadata += paths(index).toString
+  index += 1
+}
+metadata + "]"

Review comment:
   shall we add ` ...]` if we stopped midway?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata

2020-05-22 Thread GitBox


cloud-fan commented on a change in pull request #28610:
URL: https://github.com/apache/spark/pull/28610#discussion_r429089544



##
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##
@@ -2904,6 +2904,23 @@ private[spark] object Utils extends Logging {
 props.forEach((k, v) => resultProps.put(k, v))
 resultProps
   }
+
+  /**
+   * Convert a sequence of [[Path]] to a metadata string. When the length of 
metadata string
+   * exceeds `stopAppendingThreshold`, stop appending paths for saving memory.
+   */
+  def buildLocationMetadata(paths: Seq[Path], stopAppendingThreshold: Int): 
String = {
+var metadata = "["

Review comment:
   can we use `StringBuilder`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org