[jira] [Issue Comment Deleted] (SPARK-26221) Improve Spark SQL instrumentation and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26221: Comment: was deleted (was: User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23192) > Improve Spark SQL instrumentation and metrics > - > > Key: SPARK-26221 > URL: https://issues.apache.org/jira/browse/SPARK-26221 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > This is an umbrella ticket for various small improvements for better metrics > and instrumentation. Some thoughts: > > Differentiate query plan that’s writing data out, vs returning data to the > driver > * I.e. ETL & report generation vs interactive analysis > * This is related to the data sink item below. We need to make sure from the > query plan we can tell what a query is doing > Data sink: Have an operator for data sink, with metrics that can tell us: > * Write time > * Number of records written > * Size of output written > * Number of partitions modified > * Metastore update time > * Also track number of records for collect / limit > Scan > * Track file listing time (start and end so we can construct timeline, not > just duration) > * Track metastore operation time > * Track IO decoding time for row-based input sources; Need to make sure > overhead is low > Shuffle > * Track read time and write time > * Decide if we can measure serialization and deserialization > Client fetch time > * Sometimes a query take long to run because it is blocked on the client > fetching result (e.g. using a result iterator). Record the time blocked on > client so we can remove it in measuring query execution time. > Make it easy to correlate queries with jobs, stages, and tasks belonging to a > single query, e.g. dump execution id in task logs? > Better logging: > * Enable logging the query execution id and TID in executor logs, and query > execution id in driver logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26221) Improve Spark SQL instrumentation and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26221: Comment: was deleted (was: User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23192) > Improve Spark SQL instrumentation and metrics > - > > Key: SPARK-26221 > URL: https://issues.apache.org/jira/browse/SPARK-26221 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > This is an umbrella ticket for various small improvements for better metrics > and instrumentation. Some thoughts: > > Differentiate query plan that’s writing data out, vs returning data to the > driver > * I.e. ETL & report generation vs interactive analysis > * This is related to the data sink item below. We need to make sure from the > query plan we can tell what a query is doing > Data sink: Have an operator for data sink, with metrics that can tell us: > * Write time > * Number of records written > * Size of output written > * Number of partitions modified > * Metastore update time > * Also track number of records for collect / limit > Scan > * Track file listing time (start and end so we can construct timeline, not > just duration) > * Track metastore operation time > * Track IO decoding time for row-based input sources; Need to make sure > overhead is low > Shuffle > * Track read time and write time > * Decide if we can measure serialization and deserialization > Client fetch time > * Sometimes a query take long to run because it is blocked on the client > fetching result (e.g. using a result iterator). Record the time blocked on > client so we can remove it in measuring query execution time. > Make it easy to correlate queries with jobs, stages, and tasks belonging to a > single query, e.g. dump execution id in task logs? > Better logging: > * Enable logging the query execution id and TID in executor logs, and query > execution id in driver logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org