[jira] [Comment Edited] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
[ https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848824#comment-17848824 ] Eric Yang edited comment on SPARK-48397 at 5/23/24 6:38 AM: The PR: https://github.com/apache/spark/pull/46714 was (Author: JIRAUSER304132): I'm working on a PR for it. > Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker > > > Key: SPARK-48397 > URL: https://issues.apache.org/jira/browse/SPARK-48397 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > Labels: pull-request-available > > For FileFormatDataWriter we currently record metrics of "task commit time" > and "job commit time" in > `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. > We may also record the time spent on "data write" (together with the time > spent on producing records from the iterator), which is usually one of the > major parts of the total duration of a writing operation. It helps us > identify the bottleneck and time skew, and also the generic performance > tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
[ https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848824#comment-17848824 ] Eric Yang commented on SPARK-48397: --- I'm working on a PR for it. > Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker > > > Key: SPARK-48397 > URL: https://issues.apache.org/jira/browse/SPARK-48397 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > > For FileFormatDataWriter we currently record metrics of "task commit time" > and "job commit time" in > `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. > We may also record the time spent on "data write" (together with the time > spent on producing records from the iterator), which is usually one of the > major parts of the total duration of a writing operation. It helps us > identify the bottleneck and time skew, and also the generic performance > tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
Eric Yang created SPARK-48397: - Summary: Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker Key: SPARK-48397 URL: https://issues.apache.org/jira/browse/SPARK-48397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Eric Yang For FileFormatDataWriter we currently record metrics of "task commit time" and "job commit time" in `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. We may also record the time spent on "data write" (together with the time spent on producing records from the iterator), which is usually one of the major parts of the total duration of a writing operation. It helps us identify the bottleneck and time skew, and also the generic performance tuning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48298) Add TCP mode to StatsdSink
[ https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846789#comment-17846789 ] Eric Yang edited comment on SPARK-48298 at 5/16/24 4:48 AM: PR: https://github.com/apache/spark/pull/46604 was (Author: JIRAUSER304132): I'm preparing a PR for it. > Add TCP mode to StatsdSink > -- > > Key: SPARK-48298 > URL: https://issues.apache.org/jira/browse/SPARK-48298 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > Labels: pull-request-available > > Currently, the StatsdSink in Spark supports UDP mode only, which is the > default mode of StatsD. However, in real production environments, we often > find that a more reliable transmission of metrics is needed to avoid metrics > lose in high-traffic systems. > > TCP mode is already supported by Statsd: > [https://github.com/statsd/statsd/blob/master/docs/server.md] > Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] > and also many other Statsd-based metrics proxies/receivers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48298) Add TCP mode to StatsdSink
[ https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated SPARK-48298: -- Summary: Add TCP mode to StatsdSink (was: StatsdSink supports TCP mode) > Add TCP mode to StatsdSink > -- > > Key: SPARK-48298 > URL: https://issues.apache.org/jira/browse/SPARK-48298 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > > Currently, the StatsdSink in Spark supports UDP mode only, which is the > default mode of StatsD. However, in real production environments, we often > find that a more reliable transmission of metrics is needed to avoid metrics > lose in high-traffic systems. > > TCP mode is already supported by Statsd: > [https://github.com/statsd/statsd/blob/master/docs/server.md] > Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] > and also many other Statsd-based metrics proxies/receivers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48298) StatsdSink supports TCP mode
[ https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated SPARK-48298: -- Description: Currently, the StatsdSink in Spark supports UDP mode only, which is the default mode of StatsD. However, in real production environments, we often find that a more reliable transmission of metrics is needed to avoid metrics lose in high-traffic systems. TCP mode is already supported by Statsd: [https://github.com/statsd/statsd/blob/master/docs/server.md] Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] and also many other Statsd-based metrics proxies/receivers. was: Currently, the StatsdSink in Spark supports UDP mode only, which is the default mode of StatsD. However, in real production environments, we often find that a more reliable transmission of metrics is needed to avoid metrics lose in high-traffic systems. TCP mode is already supported by Statsd: [https://github.com/statsd/statsd/blob/master/docs/server.md] Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] and also many other Statsd-based metrics proxy/receiver. > StatsdSink supports TCP mode > > > Key: SPARK-48298 > URL: https://issues.apache.org/jira/browse/SPARK-48298 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > > Currently, the StatsdSink in Spark supports UDP mode only, which is the > default mode of StatsD. However, in real production environments, we often > find that a more reliable transmission of metrics is needed to avoid metrics > lose in high-traffic systems. > > TCP mode is already supported by Statsd: > [https://github.com/statsd/statsd/blob/master/docs/server.md] > Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] > and also many other Statsd-based metrics proxies/receivers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48298) StatsdSink supports TCP mode
Eric Yang created SPARK-48298: - Summary: StatsdSink supports TCP mode Key: SPARK-48298 URL: https://issues.apache.org/jira/browse/SPARK-48298 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 4.0.0 Reporter: Eric Yang Currently, the StatsdSink in Spark supports UDP mode only, which is the default mode of StatsD. However, in real production environments, we often find that a more reliable transmission of metrics is needed to avoid metrics lose in high-traffic systems. TCP mode is already supported by Statsd: [https://github.com/statsd/statsd/blob/master/docs/server.md] Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] and also many other Statsd-based metrics proxy/receiver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48298) StatsdSink supports TCP mode
[ https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846789#comment-17846789 ] Eric Yang commented on SPARK-48298: --- I'm preparing a PR for it. > StatsdSink supports TCP mode > > > Key: SPARK-48298 > URL: https://issues.apache.org/jira/browse/SPARK-48298 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Eric Yang >Priority: Major > > Currently, the StatsdSink in Spark supports UDP mode only, which is the > default mode of StatsD. However, in real production environments, we often > find that a more reliable transmission of metrics is needed to avoid metrics > lose in high-traffic systems. > > TCP mode is already supported by Statsd: > [https://github.com/statsd/statsd/blob/master/docs/server.md] > Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] > and also many other Statsd-based metrics proxy/receiver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844145#comment-17844145 ] Eric Yang commented on SPARK-47017: --- I'm preparing a PR for it. > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, > simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817786#comment-17817786 ] Eric Yang edited comment on SPARK-47017 at 2/15/24 9:30 PM: Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala] The listener event logs: [^eventLogs-local-1708032228180.zip] In L265 of the example code we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) A possible solution is to follow what InMemoryRelation is doing: in which it keeps the original physical plan so we still have a chance to show the DAG and the metric values somewhere. was (Author: JIRAUSER304132): Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala] The listener event logs: [^eventLogs-local-1708032228180.zip] In L265 of the example code we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, > simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817786#comment-17817786 ] Eric Yang edited comment on SPARK-47017 at 2/15/24 9:27 PM: Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala] The listener event logs: [^eventLogs-local-1708032228180.zip] In L265 of the example code we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) was (Author: JIRAUSER304132): Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala][^simple2.scala][^simple2.scala] [^eventLogs-local-1708032228180.zip] In L265 we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, > simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817786#comment-17817786 ] Eric Yang commented on SPARK-47017: --- Here is a simple example of this issue (based on the example code under the package 'org.apache.spark.examples.sql'): [^simple2.scala][^simple2.scala][^simple2.scala] [^eventLogs-local-1708032228180.zip] In L265 we create a dataset from an existing RDD "resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL metrics of this filter are not shown anywhere so we have no idea what the internal RDD execution looks like in this case (imagine that, instead of a simple filter, the RDD may contain very complex logic with many physical nodes.) > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, > simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated SPARK-47017: -- Attachment: eventLogs-local-1708032228180.zip > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, > simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated SPARK-47017: -- Attachment: simple2.scala > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg, simple2.scala > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated SPARK-47017: -- Attachment: ScanExistingRDD.jpg > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !image-2024-02-09-09-34-33-442.png|width=380,height=345! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
[ https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated SPARK-47017: -- Description: The RDDScanExec wraps an internal RDD (as below). In our environment, we find that this RDD is usually produced by some very large physical plans which contain quite a few physical nodes. Those nodes may have various metrics which are very useful for us to know what the execution looks like and any room for optimization, etc. {code:java} case class RDDScanExec( output: Seq[Attribute], rdd: RDD[InternalRow], <-- this field name: String, {code} However, the physical plan and the metrics are invisible from the SQL DAG in the Spark History Server. As it is an "existing RDD", the physical plan may be found from some previous SQL. The metrics are not visible from that previous SQL either. This is because the "definition" of these metrics are reported along with the SparkListenerSQLExecutionStart event of the "previous SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric values are reported from the SparkListenerTaskEnd event of the tasks which are attached to the SQL with RDDScanExec. !ScanExistingRDD.jpg|width=336,height=296! Do we consider showing the physical plan and metrics of the RDDScanExec.rdd (the "Scan Existing RDD" node in the above DAG). For example, it may be shown as a "leg" (similar to but not the same as a child) in the DAG, or something else that may show the physical plan and metrics? was: The RDDScanExec wraps an internal RDD (as below). In our environment, we find that this RDD is usually produced by some very large physical plans which contain quite a few physical nodes. Those nodes may have various metrics which are very useful for us to know what the execution looks like and any room for optimization, etc. {code:java} case class RDDScanExec( output: Seq[Attribute], rdd: RDD[InternalRow], <-- this field name: String, {code} However, the physical plan and the metrics are invisible from the SQL DAG in the Spark History Server. As it is an "existing RDD", the physical plan may be found from some previous SQL. The metrics are not visible from that previous SQL either. This is because the "definition" of these metrics are reported along with the SparkListenerSQLExecutionStart event of the "previous SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric values are reported from the SparkListenerTaskEnd event of the tasks which are attached to the SQL with RDDScanExec. !image-2024-02-09-09-34-33-442.png|width=380,height=345! Do we consider showing the physical plan and metrics of the RDDScanExec.rdd (the "Scan Existing RDD" node in the above DAG). For example, it may be shown as a "leg" (similar to but not the same as a child) in the DAG, or something else that may show the physical plan and metrics? > Show metrics of the physical plan of RDDScanExec's internal RDD in the > history server > - > > Key: SPARK-47017 > URL: https://issues.apache.org/jira/browse/SPARK-47017 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.4.0, 3.5.0 >Reporter: Eric Yang >Priority: Major > Attachments: ScanExistingRDD.jpg > > > The RDDScanExec wraps an internal RDD (as below). In our environment, we find > that this RDD is usually produced by some very large physical plans which > contain quite a few physical nodes. Those nodes may have various metrics > which are very useful for us to know what the execution looks like and any > room for optimization, etc. > > {code:java} > case class RDDScanExec( > output: Seq[Attribute], > rdd: RDD[InternalRow], <-- this field > name: String, {code} > > However, the physical plan and the metrics are invisible from the SQL DAG in > the Spark History Server. As it is an "existing RDD", the physical plan may > be found from some previous SQL. The metrics are not visible from that > previous SQL either. This is because the "definition" of these metrics are > reported along with the SparkListenerSQLExecutionStart event of the "previous > SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric > values are reported from the SparkListenerTaskEnd event of the tasks which > are attached to the SQL with RDDScanExec. > !ScanExistingRDD.jpg|width=336,height=296! > > Do we consider showing the physical plan and metrics of the RDDScanExec.rdd > (the "Scan Existing RDD" node in the above DAG). For example, it may be shown > as a "leg" (similar to but not the same as a child) in the DAG, or something > else that may show the physical plan and metrics? > -- This message was sent by
[jira] [Created] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server
Eric Yang created SPARK-47017: - Summary: Show metrics of the physical plan of RDDScanExec's internal RDD in the history server Key: SPARK-47017 URL: https://issues.apache.org/jira/browse/SPARK-47017 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 3.5.0, 3.4.0 Reporter: Eric Yang The RDDScanExec wraps an internal RDD (as below). In our environment, we find that this RDD is usually produced by some very large physical plans which contain quite a few physical nodes. Those nodes may have various metrics which are very useful for us to know what the execution looks like and any room for optimization, etc. {code:java} case class RDDScanExec( output: Seq[Attribute], rdd: RDD[InternalRow], <-- this field name: String, {code} However, the physical plan and the metrics are invisible from the SQL DAG in the Spark History Server. As it is an "existing RDD", the physical plan may be found from some previous SQL. The metrics are not visible from that previous SQL either. This is because the "definition" of these metrics are reported along with the SparkListenerSQLExecutionStart event of the "previous SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric values are reported from the SparkListenerTaskEnd event of the tasks which are attached to the SQL with RDDScanExec. !image-2024-02-09-09-34-33-442.png|width=380,height=345! Do we consider showing the physical plan and metrics of the RDDScanExec.rdd (the "Scan Existing RDD" node in the above DAG). For example, it may be shown as a "leg" (similar to but not the same as a child) in the DAG, or something else that may show the physical plan and metrics? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23717) Leverage docker support in Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100 ] Eric Yang edited comment on SPARK-23717 at 9/28/18 4:33 PM: It is possible to run standalone Spark in YARN docker containers without any code modification to spark. Here is an example yarnfile that I used to run mesosphere generated docker image and it ran fine: {code} { "name": "spark", "kerberos_principal" : { "principal_name" : "spark/_h...@example.com", "keytab" : "file:///etc/security/keytabs/spark.service.keytab" }, "version": "0.1", "components" : [ { "name": "driver", "number_of_containers": 1, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } }, { "name": "executor", "number_of_containers": 2, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh spark://driver-0.spark.spark.ycluster:7077", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "dependencies": [ "driver" ], "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } } ] } {code} The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and updated to respond to DNS queries. The sleep could be a lot shorter like 3 seconds. I did not spend much time to try to fine tune the DNS wait time. Further enhancement to pass in keytab and krb5.conf can enable access to secure HDFS, that would be exercise for the readers of this JIRA. was (Author: eyang): It is possible to run standalone Spark in YARN without any code modification to spark. Here is an example yarnfile that I used to run mesosphere generated docker image and it ran fine: {code} { "name": "spark", "kerberos_principal" : { "principal_name" : "spark/_h...@example.com", "keytab" : "file:///etc/security/keytabs/spark.service.keytab" }, "version": "0.1", "components" : [ { "name": "driver", "number_of_containers": 1, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } }, { "name": "executor", "number_of_containers": 2, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh spark://driver-0.spark.spark.ycluster:7077", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "dependencies": [ "driver" ], "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } } ] } {code} The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and updated to respond to DNS queries. The sleep could be a lot shorter like 3 seconds. I did not spend much time to try to fine tune the DNS wait time. Further enhancement to pass in keytab and krb5.conf can enable access to secure HDFS, that would be exercise for the readers of this JIRA. > Leverage docker support in Hadoop 3 > --- > > Key: SPARK-23717 > URL: https://issues.apache.org/jira/browse/SPARK-23717 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.4.0 >Reporter: Mridul Muralidharan >
[jira] [Commented] (SPARK-23717) Leverage docker support in Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100 ] Eric Yang commented on SPARK-23717: --- It is possible to run standalone Spark in YARN without any code modification to spark. Here is an example yarnfile that I used to run mesosphere generated docker image and it ran fine: {code} { "name": "spark", "kerberos_principal" : { "principal_name" : "spark/_h...@example.com", "keytab" : "file:///etc/security/keytabs/spark.service.keytab" }, "version": "0.1", "components" : [ { "name": "driver", "number_of_containers": 1, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } }, { "name": "executor", "number_of_containers": 2, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh spark://driver-0.spark.spark.ycluster:7077", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "dependencies": [ "driver" ], "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } } ] } {code} The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and updated to respond to DNS queries. The sleep could be a lot shorter like 3 seconds. I did not spend much time to try to fine tune the DNS wait time. Further enhancement to pass in keytab and krb5.conf can enable access to secure HDFS, that would be exercise for the readers of this JIRA. > Leverage docker support in Hadoop 3 > --- > > Key: SPARK-23717 > URL: https://issues.apache.org/jira/browse/SPARK-23717 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.4.0 >Reporter: Mridul Muralidharan >Priority: Major > > The introduction of docker support in Apache Hadoop 3 can be leveraged by > Apache Spark for resolving multiple long standing shortcoming - particularly > related to package isolation. > It also allows for network isolation, where applicable, allowing for more > sophisticated cluster configuration/customization. > This jira will track the various tasks for enhancing spark to leverage > container support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606052#comment-16606052 ] Eric Yang commented on SPARK-25330: --- {quote} user.getRealUser(): ad...@kerberos.mycom.com (auth:KERBEROS) user.getRealUser().isFromKeytab(): false user.getRealUser().hasKerberosCredentials(): false {quote} If I am reading this correctly, the RealUser must be from either a keytab or hasKerberosCredentials. Both can not be false, otherwise, it is a security breach to Kerberos that RealUser was not authorized by KDC. [~daryn] [~jlowe] thoughts? > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605114#comment-16605114 ] Eric Yang commented on SPARK-25330: --- [~yumwang] Does Hadoop 2.7.5 works? It might help us to isolate the release that started the regression to isolate the number of JIRAs that Hadoop team needs to go through. Thanks > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org