[jira] [Issue Comment Deleted] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-12-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:

Comment: was deleted

(was: User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23192)

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-12-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:

Comment: was deleted

(was: User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23192)

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org