[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722561#comment-16722561
 ] 

ASF GitHub Bot commented on SPARK-26222:


xuanyuanking opened a new pull request #23327: [SPARK-26222][SQL] Track file 
listing time
URL: https://github.com/apache/spark/pull/23327
 
 
   ## What changes were proposed in this pull request?
   
   Tracking file listing time and add them into scan node's metrics, also add 
the start and end timestamp metrics.
   
![image](https://user-images.githubusercontent.com/4833765/50058268-cacfd200-01b0-11e9-95dc-b2093fc54178.png)
   
   
   ## How was this patch tested?
   
   Add new tests in SQLMetricsSuite.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722533#comment-16722533
 ] 

ASF GitHub Bot commented on SPARK-26222:


xuanyuanking closed pull request #23298: [SPARK-26222][SQL] Track file listing 
time
URL: https://github.com/apache/spark/pull/23298
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala
index cd75407c7ee7a..9d391741f596f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala
@@ -28,7 +28,7 @@ import org.apache.spark.util.BoundedPriorityQueue
  * There are two separate concepts we track:
  *
  * 1. Phases: These are broad scope phases in query planning, as listed below, 
i.e. analysis,
- * optimizationm and physical planning (just planning).
+ * optimization and physical planning (just planning).
  *
  * 2. Rules: These are the individual Catalyst rules that we track. In 
addition to time, we also
  * track the number of invocations and effective invocations.
@@ -41,6 +41,10 @@ object QueryPlanningTracker {
   val OPTIMIZATION = "optimization"
   val PLANNING = "planning"
 
+  // File listing relative phases.
+  val FILE_LISTING = "fileListing"
+  val PARTITION_PRUNING = "partitionPruning"
+
   /**
* Summary for a rule.
* @param totalTimeNs total amount of time, in nanosecs, spent in this rule.
@@ -79,7 +83,11 @@ object QueryPlanningTracker {
   }
 
   /** Returns the current tracker in scope, based on the thread local 
variable. */
-  def get: Option[QueryPlanningTracker] = Option(localTracker.get())
+  def get: QueryPlanningTracker = 
Option(localTracker.get()).getOrElse(NoopTracker)
+
+  /** Returns the current tracker in scope or create a new one. */
+  def getOrCreate: QueryPlanningTracker =
+Option(localTracker.get()).getOrElse(new QueryPlanningTracker)
 
   /** Sets the current tracker for the execution of function f. We assume f is 
single-threaded. */
   def withTracker[T](tracker: QueryPlanningTracker)(f: => T): T = {
@@ -161,3 +169,9 @@ class QueryPlanningTracker {
   }
 
 }
+
+/** A no-op tracker used for current tracker in scope not set. */
+object NoopTracker extends QueryPlanningTracker {
+  override def measurePhase[T](phase: String)(f: => T): T = f
+  override def recordRuleInvocation(rule: String, timeNs: Long, effective: 
Boolean): Unit = {}
+}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
index cf6ff4f986399..f1a0b1ae2b6e7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
@@ -86,7 +86,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends 
Logging {
 var curPlan = plan
 val queryExecutionMetrics = RuleExecutor.queryExecutionMeter
 val planChangeLogger = new PlanChangeLogger()
-val tracker: Option[QueryPlanningTracker] = QueryPlanningTracker.get
+val tracker: QueryPlanningTracker = QueryPlanningTracker.get
 
 batches.foreach { batch =>
   val batchStartPlan = curPlan
@@ -112,7 +112,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] 
extends Logging {
 queryExecutionMetrics.incNumExecution(rule.ruleName)
 
 // Record timing information using QueryPlanningTracker
-tracker.foreach(_.recordRuleInvocation(rule.ruleName, runTime, 
effective))
+tracker.recordRuleInvocation(rule.ruleName, runTime, effective)
 
 // Run the structural integrity checker against the plan after 
each rule.
 if (!isPlanIntegral(result)) {
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 9751528654ffb..917c9cc371fc0 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -28,6 +28,7 @@ import org.apache.spark.annotation.Stable
 import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.QueryPlanningTracker
 import org.apache.spark.sql.catalyst.csv.{CSVHeaderChecker, CSVOptions, 
UnivocityParser}
 import 

[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-13 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720974#comment-16720974
 ] 

Yuanjian Li commented on SPARK-26222:
-

Copy that, thanks for your explain !

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-13 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720957#comment-16720957
 ] 

Reynold Xin commented on SPARK-26222:
-

So I spent some time looking at the code base to understand what's going on, 
and how we should report this.

In short, I think we have two types of tables:

(1) tables that require full file listing in order to resolve the schema 
(including partition columns)
(2) tables that don't.

This means there are 3 scenarios to think about:

(1) spark.read.parquet("/path/to/table").count() -> in this case an 
InMemoryFileIndex containing all of the leaf files is created.

(2a) spark.read.table("abcd").count() -> when partitions are not tracked in the 
catalog, which is basically the same as (1)

(2b) spark.read.table("abcd").count() -> when partitions are tracked in the 
catalog. In this case a CatalogFileIndex is created. We should measure the 
listing time in CatalogFileIndex.filterPartitions.


Also instead of tracking them as phases, I'd associate the timing with the scan 
operator in SQL metrics. I'd report the start and end time, rather than just a 
single duration.

 

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-13 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720088#comment-16720088
 ] 

Yuanjian Li commented on SPARK-26222:
-

Hi Reynold, the first PR track all file listing time into a independent phase, 
which happened in analyze, optimize and DataFrameReader/Writer, is this match 
your thoughts?

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719122#comment-16719122
 ] 

ASF GitHub Bot commented on SPARK-26222:


xuanyuanking opened a new pull request #23298: [SPARK-26222][SQL] Track file 
listing time
URL: https://github.com/apache/spark/pull/23298
 
 
   ## What changes were proposed in this pull request?
   
   File listing time in scan node's SQL metrics has done and improved in 
spark-20136/SPARK-26327. In this pr we use QueryPlanningTracker to track start 
and end time of file listing.
   
   ## How was this patch tested?
   
   Add test for DataFrameWriter and Non-physical phase below:
   - DataFrameReader.load, file listing will be triggered by 
DataSource.resolveRelation.
   - Analyze rule like FindDataSourceTable.
   - Optimization rule like PruneFileSourcePartitions, 
OptimizeMetadataOnlyQuery.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-12 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719110#comment-16719110
 ] 

Yuanjian Li commented on SPARK-26222:
-

Yes, I think I misunderstood your original intention. I'll give a PR deal with 
non-physical phase first.

Currently we'll do the file listing in following scenario:

Physical phase:
 * Operations on create/refresh/insert/drop table, file listing will be 
triggered by InMemoryFileIndex.refresh.
 * DataFrameWriter.runCommand, file listing will be triggered by 
DataSource.resolveRelation. (Also included in first PR.)
 * DataSourceScanExec execution.

Non-physical phase:
 * DataFrameReader.load, file listing will be triggered by 
DataSource.resolveRelation.
 * Analyze rule like FindDataSourceTable and RelationConversions in Hive 
analyzer.
 * Optimization rule like PruneFileSourcePartitions, OptimizeMetadataOnlyQuery.

All the heavy listing file work with FileSystem without cache is done in 
InMemoryFileIndex.refresh0, we can track this base function and get rid of 
cache hit.

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715663#comment-16715663
 ] 

Reynold Xin commented on SPARK-26222:
-

Do we do any file listing in non-physical phase? E.g. if somebody does 
spark.read.parquet("..."), will we do any file listing there?

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-05 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710303#comment-16710303
 ] 

Yuanjian Li commented on SPARK-26222:
-

Leave some thoughts for further discussion:
 * There's one place has track file listing duration now in 
`FileSourceScanExec`, metrics name is `metadataTime`(maybe an inaccurate name, 
should be changed to file listing time), we should add the phase tracking here.
 * We should also add the duration and phase tracking in these 2 places:
 ** HiveMetastoreCatalog inferred Scehma.
 ** replaceTableScanWithPartitionMetadata in OptimizeMetadataOnlyQuery rule.
 * IIUC, the phase tracking can use `QueryPlanningTracker` directly cause its 
thread locally and passed through within all `RuleExecution`.
 * About the meaning of listing time, maybe we can define it's only refers to 
reading without cache because loading from cache is not the 'heavy' operator we 
want to tracking and also spend less time. The listing time not only contains 
the first time `listFiles` called, but also each time after cache was refreshed.

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org