[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722561#comment-16722561 ] ASF GitHub Bot commented on SPARK-26222: xuanyuanking opened a new pull request #23327: [SPARK-26222][SQL] Track file listing time URL: https://github.com/apache/spark/pull/23327 ## What changes were proposed in this pull request? Tracking file listing time and add them into scan node's metrics, also add the start and end timestamp metrics. ![image](https://user-images.githubusercontent.com/4833765/50058268-cacfd200-01b0-11e9-95dc-b2093fc54178.png) ## How was this patch tested? Add new tests in SQLMetricsSuite. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722533#comment-16722533 ] ASF GitHub Bot commented on SPARK-26222: xuanyuanking closed pull request #23298: [SPARK-26222][SQL] Track file listing time URL: https://github.com/apache/spark/pull/23298 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala index cd75407c7ee7a..9d391741f596f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala @@ -28,7 +28,7 @@ import org.apache.spark.util.BoundedPriorityQueue * There are two separate concepts we track: * * 1. Phases: These are broad scope phases in query planning, as listed below, i.e. analysis, - * optimizationm and physical planning (just planning). + * optimization and physical planning (just planning). * * 2. Rules: These are the individual Catalyst rules that we track. In addition to time, we also * track the number of invocations and effective invocations. @@ -41,6 +41,10 @@ object QueryPlanningTracker { val OPTIMIZATION = "optimization" val PLANNING = "planning" + // File listing relative phases. + val FILE_LISTING = "fileListing" + val PARTITION_PRUNING = "partitionPruning" + /** * Summary for a rule. * @param totalTimeNs total amount of time, in nanosecs, spent in this rule. @@ -79,7 +83,11 @@ object QueryPlanningTracker { } /** Returns the current tracker in scope, based on the thread local variable. */ - def get: Option[QueryPlanningTracker] = Option(localTracker.get()) + def get: QueryPlanningTracker = Option(localTracker.get()).getOrElse(NoopTracker) + + /** Returns the current tracker in scope or create a new one. */ + def getOrCreate: QueryPlanningTracker = +Option(localTracker.get()).getOrElse(new QueryPlanningTracker) /** Sets the current tracker for the execution of function f. We assume f is single-threaded. */ def withTracker[T](tracker: QueryPlanningTracker)(f: => T): T = { @@ -161,3 +169,9 @@ class QueryPlanningTracker { } } + +/** A no-op tracker used for current tracker in scope not set. */ +object NoopTracker extends QueryPlanningTracker { + override def measurePhase[T](phase: String)(f: => T): T = f + override def recordRuleInvocation(rule: String, timeNs: Long, effective: Boolean): Unit = {} +} diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala index cf6ff4f986399..f1a0b1ae2b6e7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala @@ -86,7 +86,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { var curPlan = plan val queryExecutionMetrics = RuleExecutor.queryExecutionMeter val planChangeLogger = new PlanChangeLogger() -val tracker: Option[QueryPlanningTracker] = QueryPlanningTracker.get +val tracker: QueryPlanningTracker = QueryPlanningTracker.get batches.foreach { batch => val batchStartPlan = curPlan @@ -112,7 +112,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { queryExecutionMetrics.incNumExecution(rule.ruleName) // Record timing information using QueryPlanningTracker -tracker.foreach(_.recordRuleInvocation(rule.ruleName, runTime, effective)) +tracker.recordRuleInvocation(rule.ruleName, runTime, effective) // Run the structural integrity checker against the plan after each rule. if (!isPlanIntegral(result)) { diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 9751528654ffb..917c9cc371fc0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -28,6 +28,7 @@ import org.apache.spark.annotation.Stable import org.apache.spark.api.java.JavaRDD import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.QueryPlanningTracker import org.apache.spark.sql.catalyst.csv.{CSVHeaderChecker, CSVOptions, UnivocityParser} import
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720974#comment-16720974 ] Yuanjian Li commented on SPARK-26222: - Copy that, thanks for your explain ! > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720957#comment-16720957 ] Reynold Xin commented on SPARK-26222: - So I spent some time looking at the code base to understand what's going on, and how we should report this. In short, I think we have two types of tables: (1) tables that require full file listing in order to resolve the schema (including partition columns) (2) tables that don't. This means there are 3 scenarios to think about: (1) spark.read.parquet("/path/to/table").count() -> in this case an InMemoryFileIndex containing all of the leaf files is created. (2a) spark.read.table("abcd").count() -> when partitions are not tracked in the catalog, which is basically the same as (1) (2b) spark.read.table("abcd").count() -> when partitions are tracked in the catalog. In this case a CatalogFileIndex is created. We should measure the listing time in CatalogFileIndex.filterPartitions. Also instead of tracking them as phases, I'd associate the timing with the scan operator in SQL metrics. I'd report the start and end time, rather than just a single duration. > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720088#comment-16720088 ] Yuanjian Li commented on SPARK-26222: - Hi Reynold, the first PR track all file listing time into a independent phase, which happened in analyze, optimize and DataFrameReader/Writer, is this match your thoughts? > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719122#comment-16719122 ] ASF GitHub Bot commented on SPARK-26222: xuanyuanking opened a new pull request #23298: [SPARK-26222][SQL] Track file listing time URL: https://github.com/apache/spark/pull/23298 ## What changes were proposed in this pull request? File listing time in scan node's SQL metrics has done and improved in spark-20136/SPARK-26327. In this pr we use QueryPlanningTracker to track start and end time of file listing. ## How was this patch tested? Add test for DataFrameWriter and Non-physical phase below: - DataFrameReader.load, file listing will be triggered by DataSource.resolveRelation. - Analyze rule like FindDataSourceTable. - Optimization rule like PruneFileSourcePartitions, OptimizeMetadataOnlyQuery. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719110#comment-16719110 ] Yuanjian Li commented on SPARK-26222: - Yes, I think I misunderstood your original intention. I'll give a PR deal with non-physical phase first. Currently we'll do the file listing in following scenario: Physical phase: * Operations on create/refresh/insert/drop table, file listing will be triggered by InMemoryFileIndex.refresh. * DataFrameWriter.runCommand, file listing will be triggered by DataSource.resolveRelation. (Also included in first PR.) * DataSourceScanExec execution. Non-physical phase: * DataFrameReader.load, file listing will be triggered by DataSource.resolveRelation. * Analyze rule like FindDataSourceTable and RelationConversions in Hive analyzer. * Optimization rule like PruneFileSourcePartitions, OptimizeMetadataOnlyQuery. All the heavy listing file work with FileSystem without cache is done in InMemoryFileIndex.refresh0, we can track this base function and get rid of cache hit. > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715663#comment-16715663 ] Reynold Xin commented on SPARK-26222: - Do we do any file listing in non-physical phase? E.g. if somebody does spark.read.parquet("..."), will we do any file listing there? > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710303#comment-16710303 ] Yuanjian Li commented on SPARK-26222: - Leave some thoughts for further discussion: * There's one place has track file listing duration now in `FileSourceScanExec`, metrics name is `metadataTime`(maybe an inaccurate name, should be changed to file listing time), we should add the phase tracking here. * We should also add the duration and phase tracking in these 2 places: ** HiveMetastoreCatalog inferred Scehma. ** replaceTableScanWithPartitionMetadata in OptimizeMetadataOnlyQuery rule. * IIUC, the phase tracking can use `QueryPlanningTracker` directly cause its thread locally and passed through within all `RuleExecution`. * About the meaning of listing time, maybe we can define it's only refers to reading without cache because loading from cache is not the 'heavy' operator we want to tracking and also spend less time. The listing time not only contains the first time `listFiles` called, but also each time after cache was refreshed. > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org