Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-06-06 Thread via GitHub
snmvaughan commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2152813241 @cloud-fan Did you still have concerns about collecting and reporting the stats per partition? -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-05-21 Thread via GitHub
snmvaughan commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2123350238 @cloud-fan Spark already collects information about the number of rows and bytes written, but only reports the total aggregate. If you're concerned about the overall size, it is

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-29 Thread via GitHub
snmvaughan commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2084197846 We're looking to collect deeper insights into what jobs are doing, beyond the current read/write statistics such as bytes, num files, etc. -- This is an automated message from the

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-29 Thread via GitHub
snmvaughan commented on code in PR #46188: URL: https://github.com/apache/spark/pull/46188#discussion_r1584036371 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala: ## @@ -223,6 +278,9 @@ class BasicWriteJobStatsTracker(

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-29 Thread via GitHub
snmvaughan commented on code in PR #46188: URL: https://github.com/apache/spark/pull/46188#discussion_r1584033823 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala: ## @@ -43,10 +44,18 @@ case class BasicWriteTaskStats(

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-28 Thread via GitHub
cloud-fan commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2081399740 is the end goal to automatically update table statistics? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-28 Thread via GitHub
cloud-fan commented on code in PR #46188: URL: https://github.com/apache/spark/pull/46188#discussion_r1582063564 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala: ## @@ -223,6 +278,9 @@ class BasicWriteJobStatsTracker(

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-26 Thread via GitHub
dbtsai commented on code in PR #46188: URL: https://github.com/apache/spark/pull/46188#discussion_r1581313246 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala: ## @@ -213,6 +260,14 @@ class BasicWriteJobStatsTracker(

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-26 Thread via GitHub
dbtsai commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2079779162 Gently pinging @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-23 Thread via GitHub
snmvaughan commented on PR #46188: URL: https://github.com/apache/spark/pull/46188#issuecomment-2072876115 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[PR] [SPARK-47050][SQL] Collect and publish partition level metrics for V1 [spark]

2024-04-23 Thread via GitHub
snmvaughan opened a new pull request, #46188: URL: https://github.com/apache/spark/pull/46188 We currently capture metrics which include the number of files, bytes and rows for a task along with the updated partitions. This change captures metrics for each updated partition, reporting