[jira] [Commented] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application

2018-10-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658343#comment-16658343
 ] 

Felix Cheung commented on SPARK-25634:
--

how about off-heap and netty buffer usage?

> New Metrics in External Shuffle Service to help identify abusing application
> 
>
> Key: SPARK-25634
> URL: https://issues.apache.org/jira/browse/SPARK-25634
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Ye Zhou
>Priority: Minor
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service. External Shuffle Service is shared by all Spark 
> applications. SPARK-24355 enables the threads reservation to handle 
> non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to 
> avoid OOM in shuffle service which could crash NodeManager. But still some 
> application may generate a large amount of shuffle blocks which could heavily 
> decrease the performance on some shuffle servers. When this abusing behavior 
> happens, it might further decreases the overall performance for other 
> applications if they happen to use the same shuffle servers. We have been 
> seeing issues like this in our cluster, but there is no way for us to figure 
> out which application is abusing shuffle service.
> SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics 
> System. It is better if we can have the following metrics and also metrics 
> divided by applicationID:
> 1. *shuffle server on-heap memory consumption for caching shuffle indexes*
> 2. *breakdown of shuffle indexes caching memory consumption by local 
> executors*
> We can generate metrics when 
> ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will 
> trigger the Cache load. We can roughly be able to get the metrics from the 
> shuffleindexfile size when putting into the cache and moved out from the 
> cache.
> 3. *shuffle server load for shuffle block fetch requests*
> 4. *breakdown of shuffle server block fetch requests load by remote executors*
> We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a 
> new OpenBlocks message is received.
> Open discussion for more metrics that could potentially influence the overall 
> shuffle service performance. 
> We can print out those metrics which are divided by applicationIDs in log, 
> since it is hard to define fixed key and use numerical value for this kind of 
> metrics. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application

2018-10-03 Thread Ye Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637436#comment-16637436
 ] 

Ye Zhou commented on SPARK-25634:
-

[~felixcheung]  [~vanzin]  [~tgraves]  [~irashid]  [~zsxwing] More comments? 
Thanks

> New Metrics in External Shuffle Service to help identify abusing application
> 
>
> Key: SPARK-25634
> URL: https://issues.apache.org/jira/browse/SPARK-25634
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Ye Zhou
>Priority: Minor
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service. External Shuffle Service is shared by all Spark 
> applications. SPARK-24355 enables the threads reservation to handle 
> non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to 
> avoid OOM in shuffle service which could crash NodeManager. But still some 
> application may generate a large amount of shuffle blocks which could heavily 
> decrease the performance on some shuffle servers. When this abusing behavior 
> happens, it might further decreases the overall performance for other 
> applications if they happen to use the same shuffle servers. We have been 
> seeing issues like this in our cluster, but there is no way for us to figure 
> out which application is abusing shuffle service.
> SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics 
> System. It is better if we can have the following metrics and also metrics 
> divided by applicationID:
> 1. *shuffle server on-heap memory consumption for caching shuffle indexes*
> 2. *breakdown of shuffle indexes caching memory consumption by local 
> executors*
> We can generate metrics when 
> ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will 
> trigger the Cache load. We can roughly be able to get the metrics from the 
> shuffleindexfile size when putting into the cache and moved out from the 
> cache.
> 3. *shuffle server load for shuffle block fetch requests*
> 4. *breakdown of shuffle server block fetch requests load by remote executors*
> We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a 
> new OpenBlocks message is received.
> Open discussion for more metrics that could potentially influence the overall 
> shuffle service performance. 
> We can print out those metrics which are divided by applicationIDs in log, 
> since it is hard to define fixed key and use numerical value for this kind of 
> metrics. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org