[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata
cloud-fan commented on a change in pull request #28610: URL: https://github.com/apache/spark/pull/28610#discussion_r429102294 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala ## @@ -116,6 +116,30 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest { assert(isIncluded(df.queryExecution, "Location")) } } + + test("FileSourceScanExec metadata should contain limited file paths") { +withTempPath { path => + val dir = path.getCanonicalPath + val partitionCol = "partitionCol" + spark.range(10) +.select("id", "id") +.toDF("value", partitionCol) +.write +.partitionBy(partitionCol) +.orc(dir) + val paths = (0 to 9).map(i => s"$dir/$partitionCol=$i") Review comment: does this work for Windows? should `new File(path, s"$partitionCol=$i"). getCanonicalPath` be more robust? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata
cloud-fan commented on a change in pull request #28610: URL: https://github.com/apache/spark/pull/28610#discussion_r429091272 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala ## @@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest { assert(isIncluded(df.queryExecution, "Location")) } } + + test("FileSourceScanExec metadata should contain limited file paths") { +withTempPath { path => + val dir = path.getCanonicalPath + val partitionCol = "partitionCol" + spark.range(10) +.select("id", "id") +.toDF("value", partitionCol) +.write +.partitionBy(partitionCol) +.orc(dir) + val paths = (0 to 9).map(i => dir + "/" + partitionCol + "=" + i) + val plan = spark +.read +.orc(paths: _*) +.queryExecution +.executedPlan + val location = plan collectFirst { +case f: FileSourceScanExec => f.metadata("Location") + } + assert(location.isDefined) + var found = false Review comment: how about `assert(location.drop(1).dropRight(1).split(",").length < paths.length)`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata
cloud-fan commented on a change in pull request #28610: URL: https://github.com/apache/spark/pull/28610#discussion_r429090593 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala ## @@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest { assert(isIncluded(df.queryExecution, "Location")) } } + + test("FileSourceScanExec metadata should contain limited file paths") { +withTempPath { path => + val dir = path.getCanonicalPath + val partitionCol = "partitionCol" + spark.range(10) +.select("id", "id") +.toDF("value", partitionCol) +.write +.partitionBy(partitionCol) +.orc(dir) + val paths = (0 to 9).map(i => dir + "/" + partitionCol + "=" + i) Review comment: nit: `new File(path, partitionCol + "=" + i)`, to be Windows friendly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata
cloud-fan commented on a change in pull request #28610: URL: https://github.com/apache/spark/pull/28610#discussion_r429089781 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2904,6 +2904,23 @@ private[spark] object Utils extends Logging { props.forEach((k, v) => resultProps.put(k, v)) resultProps } + + /** + * Convert a sequence of [[Path]] to a metadata string. When the length of metadata string + * exceeds `stopAppendingThreshold`, stop appending paths for saving memory. + */ + def buildLocationMetadata(paths: Seq[Path], stopAppendingThreshold: Int): String = { +var metadata = "[" +var index: Int = 0 +while (index < paths.length && metadata.length <= stopAppendingThreshold) { + if (index > 0) { +metadata += ", " + } + metadata += paths(index).toString + index += 1 +} +metadata + "]" Review comment: shall we add ` ...]` if we stopped midway? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28610: [SPARK-31793][SQL] Reduce the memory usage in file scan location metadata
cloud-fan commented on a change in pull request #28610: URL: https://github.com/apache/spark/pull/28610#discussion_r429089544 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2904,6 +2904,23 @@ private[spark] object Utils extends Logging { props.forEach((k, v) => resultProps.put(k, v)) resultProps } + + /** + * Convert a sequence of [[Path]] to a metadata string. When the length of metadata string + * exceeds `stopAppendingThreshold`, stop appending paths for saving memory. + */ + def buildLocationMetadata(paths: Seq[Path], stopAppendingThreshold: Int): String = { +var metadata = "[" Review comment: can we use `StringBuilder`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org