[jira] [Comment Edited] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848824#comment-17848824
 ] 

Eric Yang edited comment on SPARK-48397 at 5/23/24 6:38 AM:


The PR: https://github.com/apache/spark/pull/46714


was (Author: JIRAUSER304132):
I'm working on a PR for it.

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>  Labels: pull-request-available
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848824#comment-17848824
 ] 

Eric Yang commented on SPARK-48397:
---

I'm working on a PR for it.

> Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker
> 
>
> Key: SPARK-48397
> URL: https://issues.apache.org/jira/browse/SPARK-48397
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>
> For FileFormatDataWriter we currently record metrics of "task commit time" 
> and "job commit time" in 
> `org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`.
>  We may also record the time spent on "data write" (together with the time 
> spent on producing records from the iterator), which is usually one of the 
> major parts of the total duration of a writing operation. It helps us 
> identify the bottleneck and time skew, and also the generic performance 
> tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48397) Add data write time metric to FileFormatDataWriter/BasicWriteJobStatsTracker

2024-05-23 Thread Eric Yang (Jira)
Eric Yang created SPARK-48397:
-

 Summary: Add data write time metric to 
FileFormatDataWriter/BasicWriteJobStatsTracker
 Key: SPARK-48397
 URL: https://issues.apache.org/jira/browse/SPARK-48397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Eric Yang


For FileFormatDataWriter we currently record metrics of "task commit time" and 
"job commit time" in 
`org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker#metrics`. 
We may also record the time spent on "data write" (together with the time spent 
on producing records from the iterator), which is usually one of the major 
parts of the total duration of a writing operation. It helps us identify the 
bottleneck and time skew, and also the generic performance tuning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48298) Add TCP mode to StatsdSink

2024-05-15 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846789#comment-17846789
 ] 

Eric Yang edited comment on SPARK-48298 at 5/16/24 4:48 AM:


PR: https://github.com/apache/spark/pull/46604


was (Author: JIRAUSER304132):
I'm preparing a PR for it.

> Add TCP mode to StatsdSink
> --
>
> Key: SPARK-48298
> URL: https://issues.apache.org/jira/browse/SPARK-48298
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the StatsdSink in Spark supports UDP mode only, which is the 
> default mode of StatsD. However, in real production environments, we often 
> find that a more reliable transmission of metrics is needed to avoid metrics 
> lose in high-traffic systems.
>  
> TCP mode is already supported by Statsd: 
> [https://github.com/statsd/statsd/blob/master/docs/server.md]
> Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 
> and also many other Statsd-based metrics proxies/receivers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48298) Add TCP mode to StatsdSink

2024-05-15 Thread Eric Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-48298:
--
Summary: Add TCP mode to StatsdSink  (was: StatsdSink supports TCP mode)

> Add TCP mode to StatsdSink
> --
>
> Key: SPARK-48298
> URL: https://issues.apache.org/jira/browse/SPARK-48298
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>
> Currently, the StatsdSink in Spark supports UDP mode only, which is the 
> default mode of StatsD. However, in real production environments, we often 
> find that a more reliable transmission of metrics is needed to avoid metrics 
> lose in high-traffic systems.
>  
> TCP mode is already supported by Statsd: 
> [https://github.com/statsd/statsd/blob/master/docs/server.md]
> Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 
> and also many other Statsd-based metrics proxies/receivers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48298) StatsdSink supports TCP mode

2024-05-15 Thread Eric Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-48298:
--
Description: 
Currently, the StatsdSink in Spark supports UDP mode only, which is the default 
mode of StatsD. However, in real production environments, we often find that a 
more reliable transmission of metrics is needed to avoid metrics lose in 
high-traffic systems.

 

TCP mode is already supported by Statsd: 
[https://github.com/statsd/statsd/blob/master/docs/server.md]

Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 

and also many other Statsd-based metrics proxies/receivers.

  was:
Currently, the StatsdSink in Spark supports UDP mode only, which is the default 
mode of StatsD. However, in real production environments, we often find that a 
more reliable transmission of metrics is needed to avoid metrics lose in 
high-traffic systems.

 

TCP mode is already supported by Statsd: 
[https://github.com/statsd/statsd/blob/master/docs/server.md]

Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 

and also many other Statsd-based metrics proxy/receiver.


> StatsdSink supports TCP mode
> 
>
> Key: SPARK-48298
> URL: https://issues.apache.org/jira/browse/SPARK-48298
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>
> Currently, the StatsdSink in Spark supports UDP mode only, which is the 
> default mode of StatsD. However, in real production environments, we often 
> find that a more reliable transmission of metrics is needed to avoid metrics 
> lose in high-traffic systems.
>  
> TCP mode is already supported by Statsd: 
> [https://github.com/statsd/statsd/blob/master/docs/server.md]
> Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 
> and also many other Statsd-based metrics proxies/receivers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48298) StatsdSink supports TCP mode

2024-05-15 Thread Eric Yang (Jira)
Eric Yang created SPARK-48298:
-

 Summary: StatsdSink supports TCP mode
 Key: SPARK-48298
 URL: https://issues.apache.org/jira/browse/SPARK-48298
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Eric Yang


Currently, the StatsdSink in Spark supports UDP mode only, which is the default 
mode of StatsD. However, in real production environments, we often find that a 
more reliable transmission of metrics is needed to avoid metrics lose in 
high-traffic systems.

 

TCP mode is already supported by Statsd: 
[https://github.com/statsd/statsd/blob/master/docs/server.md]

Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 

and also many other Statsd-based metrics proxy/receiver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48298) StatsdSink supports TCP mode

2024-05-15 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846789#comment-17846789
 ] 

Eric Yang commented on SPARK-48298:
---

I'm preparing a PR for it.

> StatsdSink supports TCP mode
> 
>
> Key: SPARK-48298
> URL: https://issues.apache.org/jira/browse/SPARK-48298
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Eric Yang
>Priority: Major
>
> Currently, the StatsdSink in Spark supports UDP mode only, which is the 
> default mode of StatsD. However, in real production environments, we often 
> find that a more reliable transmission of metrics is needed to avoid metrics 
> lose in high-traffic systems.
>  
> TCP mode is already supported by Statsd: 
> [https://github.com/statsd/statsd/blob/master/docs/server.md]
> Prometheus' statsd_exporter: [https://github.com/prometheus/statsd_exporter] 
> and also many other Statsd-based metrics proxy/receiver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-05-06 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844145#comment-17844145
 ] 

Eric Yang commented on SPARK-47017:
---

I'm preparing a PR for it. 

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, 
> simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-15 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817786#comment-17817786
 ] 

Eric Yang edited comment on SPARK-47017 at 2/15/24 9:30 PM:


Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): [^simple2.scala]

The listener event logs: [^eventLogs-local-1708032228180.zip]

In L265 of the example code we create a dataset from an existing RDD 
"resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to 
an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL 
metrics of this filter are not shown anywhere so we have no idea what the 
internal RDD execution looks like in this case (imagine that, instead of a 
simple filter, the RDD may contain very complex logic with many physical nodes.)

 

A possible solution is to follow what InMemoryRelation is doing: in which it 
keeps the original physical plan so we still have a chance to show the DAG and 
the metric values somewhere.


was (Author: JIRAUSER304132):
Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): [^simple2.scala]

The listener event logs: [^eventLogs-local-1708032228180.zip]

In L265 of the example code we create a dataset from an existing RDD 
"resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to 
an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL 
metrics of this filter are not shown anywhere so we have no idea what the 
internal RDD execution looks like in this case (imagine that, instead of a 
simple filter, the RDD may contain very complex logic with many physical nodes.)

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, 
> simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-15 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817786#comment-17817786
 ] 

Eric Yang edited comment on SPARK-47017 at 2/15/24 9:27 PM:


Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): [^simple2.scala]

The listener event logs: [^eventLogs-local-1708032228180.zip]

In L265 of the example code we create a dataset from an existing RDD 
"resultsRDD", which creates a "LogicalRDD". The LogicalRDD node is converted to 
an RDDScanExec later and its internal RDD has a filter (age > 20). The SQL 
metrics of this filter are not shown anywhere so we have no idea what the 
internal RDD execution looks like in this case (imagine that, instead of a 
simple filter, the RDD may contain very complex logic with many physical nodes.)


was (Author: JIRAUSER304132):
Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): 
[^simple2.scala][^simple2.scala][^simple2.scala] 
[^eventLogs-local-1708032228180.zip]

In L265 we create a dataset from an existing RDD "resultsRDD", which creates a 
"LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its 
internal RDD has a filter (age > 20). The SQL metrics of this filter are not 
shown anywhere so we have no idea what the internal RDD execution looks like in 
this case (imagine that, instead of a simple filter, the RDD may contain very 
complex logic with many physical nodes.)

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, 
> simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-15 Thread Eric Yang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817786#comment-17817786
 ] 

Eric Yang commented on SPARK-47017:
---

Here is a simple example of this issue (based on the example code under the 
package 'org.apache.spark.examples.sql'): 
[^simple2.scala][^simple2.scala][^simple2.scala] 
[^eventLogs-local-1708032228180.zip]

In L265 we create a dataset from an existing RDD "resultsRDD", which creates a 
"LogicalRDD". The LogicalRDD node is converted to an RDDScanExec later and its 
internal RDD has a filter (age > 20). The SQL metrics of this filter are not 
shown anywhere so we have no idea what the internal RDD execution looks like in 
this case (imagine that, instead of a simple filter, the RDD may contain very 
complex logic with many physical nodes.)

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, 
> simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-15 Thread Eric Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-47017:
--
Attachment: eventLogs-local-1708032228180.zip

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg, eventLogs-local-1708032228180.zip, 
> simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-15 Thread Eric Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-47017:
--
Attachment: simple2.scala

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg, simple2.scala
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-09 Thread Eric Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-47017:
--
Attachment: ScanExistingRDD.jpg

> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !image-2024-02-09-09-34-33-442.png|width=380,height=345!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-09 Thread Eric Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated SPARK-47017:
--
Description: 
The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
that this RDD is usually produced by some very large physical plans which 
contain quite a few physical nodes. Those nodes may have various metrics which 
are very useful for us to know what the execution looks like and any room for 
optimization, etc.

 
{code:java}
case class RDDScanExec(
    output: Seq[Attribute],
    rdd: RDD[InternalRow], <-- this field
    name: String, {code}
 

However, the physical plan and the metrics are invisible from the SQL DAG in 
the Spark History Server. As it is an "existing RDD", the physical plan may be 
found from some previous SQL. The metrics are not visible from that previous 
SQL either. This is because the "definition" of these metrics are reported 
along with the SparkListenerSQLExecutionStart event of the "previous SQL" 
(where the physical plan of the RDDScanExec.rdd is in), but the metric values 
are reported from the SparkListenerTaskEnd event of the tasks which are 
attached to the SQL with RDDScanExec.

!ScanExistingRDD.jpg|width=336,height=296!

 

Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
(the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
as a "leg" (similar to but not the same as a child) in the DAG, or something 
else that may show the physical plan and metrics?

 

  was:
The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
that this RDD is usually produced by some very large physical plans which 
contain quite a few physical nodes. Those nodes may have various metrics which 
are very useful for us to know what the execution looks like and any room for 
optimization, etc.

 
{code:java}
case class RDDScanExec(
    output: Seq[Attribute],
    rdd: RDD[InternalRow], <-- this field
    name: String, {code}
 

However, the physical plan and the metrics are invisible from the SQL DAG in 
the Spark History Server. As it is an "existing RDD", the physical plan may be 
found from some previous SQL. The metrics are not visible from that previous 
SQL either. This is because the "definition" of these metrics are reported 
along with the SparkListenerSQLExecutionStart event of the "previous SQL" 
(where the physical plan of the RDDScanExec.rdd is in), but the metric values 
are reported from the SparkListenerTaskEnd event of the tasks which are 
attached to the SQL with RDDScanExec.

!image-2024-02-09-09-34-33-442.png|width=380,height=345!

 

Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
(the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
as a "leg" (similar to but not the same as a child) in the DAG, or something 
else that may show the physical plan and metrics?

 


> Show metrics of the physical plan of RDDScanExec's internal RDD in the 
> history server
> -
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: ScanExistingRDD.jpg
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
> that this RDD is usually produced by some very large physical plans which 
> contain quite a few physical nodes. Those nodes may have various metrics 
> which are very useful for us to know what the execution looks like and any 
> room for optimization, etc.
>  
> {code:java}
> case class RDDScanExec(
>     output: Seq[Attribute],
>     rdd: RDD[InternalRow], <-- this field
>     name: String, {code}
>  
> However, the physical plan and the metrics are invisible from the SQL DAG in 
> the Spark History Server. As it is an "existing RDD", the physical plan may 
> be found from some previous SQL. The metrics are not visible from that 
> previous SQL either. This is because the "definition" of these metrics are 
> reported along with the SparkListenerSQLExecutionStart event of the "previous 
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric 
> values are reported from the SparkListenerTaskEnd event of the tasks which 
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>  
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
> as a "leg" (similar to but not the same as a child) in the DAG, or something 
> else that may show the physical plan and metrics?
>  



--
This message was sent by 

[jira] [Created] (SPARK-47017) Show metrics of the physical plan of RDDScanExec's internal RDD in the history server

2024-02-09 Thread Eric Yang (Jira)
Eric Yang created SPARK-47017:
-

 Summary: Show metrics of the physical plan of RDDScanExec's 
internal RDD in the history server
 Key: SPARK-47017
 URL: https://issues.apache.org/jira/browse/SPARK-47017
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 3.5.0, 3.4.0
Reporter: Eric Yang


The RDDScanExec wraps an internal RDD (as below). In our environment, we find 
that this RDD is usually produced by some very large physical plans which 
contain quite a few physical nodes. Those nodes may have various metrics which 
are very useful for us to know what the execution looks like and any room for 
optimization, etc.

 
{code:java}
case class RDDScanExec(
    output: Seq[Attribute],
    rdd: RDD[InternalRow], <-- this field
    name: String, {code}
 

However, the physical plan and the metrics are invisible from the SQL DAG in 
the Spark History Server. As it is an "existing RDD", the physical plan may be 
found from some previous SQL. The metrics are not visible from that previous 
SQL either. This is because the "definition" of these metrics are reported 
along with the SparkListenerSQLExecutionStart event of the "previous SQL" 
(where the physical plan of the RDDScanExec.rdd is in), but the metric values 
are reported from the SparkListenerTaskEnd event of the tasks which are 
attached to the SQL with RDDScanExec.

!image-2024-02-09-09-34-33-442.png|width=380,height=345!

 

Do we consider showing the physical plan and metrics of the RDDScanExec.rdd 
(the "Scan Existing RDD" node in the above DAG). For example, it may be shown 
as a "leg" (similar to but not the same as a child) in the DAG, or something 
else that may show the physical plan and metrics?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23717) Leverage docker support in Hadoop 3

2018-09-28 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100
 ] 

Eric Yang edited comment on SPARK-23717 at 9/28/18 4:33 PM:


It is possible to run standalone Spark in YARN docker containers without any 
code modification to spark.  Here is an example yarnfile that I used to run 
mesosphere generated docker image and it ran fine:

{code}
{
  "name": "spark",
  "kerberos_principal" : {
"principal_name" : "spark/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/spark.service.keytab"
  },
  "version": "0.1",
  "components" :
  [
{
  "name": "driver",
  "number_of_containers": 1,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
},
{
  "name": "executor",
  "number_of_containers": 2,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh 
spark://driver-0.spark.spark.ycluster:7077",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "dependencies": [ "driver" ],
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}

The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and 
updated to respond to DNS queries.  The sleep could be a lot shorter like 3 
seconds.  I did not spend much time to try to fine tune the DNS wait time.  
Further enhancement to pass in keytab and krb5.conf can enable access to secure 
HDFS, that would be exercise for the readers of this JIRA.


was (Author: eyang):
It is possible to run standalone Spark in YARN without any code modification to 
spark.  Here is an example yarnfile that I used to run mesosphere generated 
docker image and it ran fine:

{code}
{
  "name": "spark",
  "kerberos_principal" : {
"principal_name" : "spark/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/spark.service.keytab"
  },
  "version": "0.1",
  "components" :
  [
{
  "name": "driver",
  "number_of_containers": 1,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
},
{
  "name": "executor",
  "number_of_containers": 2,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh 
spark://driver-0.spark.spark.ycluster:7077",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "dependencies": [ "driver" ],
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}

The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and 
updated to respond to DNS queries.  The sleep could be a lot shorter like 3 
seconds.  I did not spend much time to try to fine tune the DNS wait time.  
Further enhancement to pass in keytab and krb5.conf can enable access to secure 
HDFS, that would be exercise for the readers of this JIRA.

> Leverage docker support in Hadoop 3
> ---
>
> Key: SPARK-23717
> URL: https://issues.apache.org/jira/browse/SPARK-23717
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>

[jira] [Commented] (SPARK-23717) Leverage docker support in Hadoop 3

2018-09-28 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100
 ] 

Eric Yang commented on SPARK-23717:
---

It is possible to run standalone Spark in YARN without any code modification to 
spark.  Here is an example yarnfile that I used to run mesosphere generated 
docker image and it ran fine:

{code}
{
  "name": "spark",
  "kerberos_principal" : {
"principal_name" : "spark/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/spark.service.keytab"
  },
  "version": "0.1",
  "components" :
  [
{
  "name": "driver",
  "number_of_containers": 1,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
},
{
  "name": "executor",
  "number_of_containers": 2,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh 
spark://driver-0.spark.spark.ycluster:7077",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "dependencies": [ "driver" ],
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}

The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and 
updated to respond to DNS queries.  The sleep could be a lot shorter like 3 
seconds.  I did not spend much time to try to fine tune the DNS wait time.  
Further enhancement to pass in keytab and krb5.conf can enable access to secure 
HDFS, that would be exercise for the readers of this JIRA.

> Leverage docker support in Hadoop 3
> ---
>
> Key: SPARK-23717
> URL: https://issues.apache.org/jira/browse/SPARK-23717
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> The introduction of docker support in Apache Hadoop 3 can be leveraged by 
> Apache Spark for resolving multiple long standing shortcoming - particularly 
> related to package isolation.
> It also allows for network isolation, where applicable, allowing for more 
> sophisticated cluster configuration/customization.
> This jira will track the various tasks for enhancing spark to leverage 
> container support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-06 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606052#comment-16606052
 ] 

Eric Yang commented on SPARK-25330:
---

{quote}
user.getRealUser(): ad...@kerberos.mycom.com (auth:KERBEROS)
user.getRealUser().isFromKeytab(): false
user.getRealUser().hasKerberosCredentials(): false
{quote}

If I am reading this correctly, the RealUser must be from either a keytab or 
hasKerberosCredentials.  Both can not be false, otherwise, it is a security 
breach to Kerberos that RealUser was not authorized by KDC.  [~daryn] [~jlowe] 
thoughts?

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-05 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605114#comment-16605114
 ] 

Eric Yang commented on SPARK-25330:
---

[~yumwang] Does Hadoop 2.7.5 works?  It might help us to isolate the release 
that started the regression to isolate the number of JIRAs that Hadoop team 
needs to go through.  Thanks

> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org