[jira] [Commented] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application
[ https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658343#comment-16658343 ] Felix Cheung commented on SPARK-25634: -- how about off-heap and netty buffer usage? > New Metrics in External Shuffle Service to help identify abusing application > > > Key: SPARK-25634 > URL: https://issues.apache.org/jira/browse/SPARK-25634 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Ye Zhou >Priority: Minor > > We run Spark on YARN, and deploy Spark external shuffle service as part of > YARN NM aux service. External Shuffle Service is shared by all Spark > applications. SPARK-24355 enables the threads reservation to handle > non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to > avoid OOM in shuffle service which could crash NodeManager. But still some > application may generate a large amount of shuffle blocks which could heavily > decrease the performance on some shuffle servers. When this abusing behavior > happens, it might further decreases the overall performance for other > applications if they happen to use the same shuffle servers. We have been > seeing issues like this in our cluster, but there is no way for us to figure > out which application is abusing shuffle service. > SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics > System. It is better if we can have the following metrics and also metrics > divided by applicationID: > 1. *shuffle server on-heap memory consumption for caching shuffle indexes* > 2. *breakdown of shuffle indexes caching memory consumption by local > executors* > We can generate metrics when > ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will > trigger the Cache load. We can roughly be able to get the metrics from the > shuffleindexfile size when putting into the cache and moved out from the > cache. > 3. *shuffle server load for shuffle block fetch requests* > 4. *breakdown of shuffle server block fetch requests load by remote executors* > We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a > new OpenBlocks message is received. > Open discussion for more metrics that could potentially influence the overall > shuffle service performance. > We can print out those metrics which are divided by applicationIDs in log, > since it is hard to define fixed key and use numerical value for this kind of > metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application
[ https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637436#comment-16637436 ] Ye Zhou commented on SPARK-25634: - [~felixcheung] [~vanzin] [~tgraves] [~irashid] [~zsxwing] More comments? Thanks > New Metrics in External Shuffle Service to help identify abusing application > > > Key: SPARK-25634 > URL: https://issues.apache.org/jira/browse/SPARK-25634 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Ye Zhou >Priority: Minor > > We run Spark on YARN, and deploy Spark external shuffle service as part of > YARN NM aux service. External Shuffle Service is shared by all Spark > applications. SPARK-24355 enables the threads reservation to handle > non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to > avoid OOM in shuffle service which could crash NodeManager. But still some > application may generate a large amount of shuffle blocks which could heavily > decrease the performance on some shuffle servers. When this abusing behavior > happens, it might further decreases the overall performance for other > applications if they happen to use the same shuffle servers. We have been > seeing issues like this in our cluster, but there is no way for us to figure > out which application is abusing shuffle service. > SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics > System. It is better if we can have the following metrics and also metrics > divided by applicationID: > 1. *shuffle server on-heap memory consumption for caching shuffle indexes* > 2. *breakdown of shuffle indexes caching memory consumption by local > executors* > We can generate metrics when > ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will > trigger the Cache load. We can roughly be able to get the metrics from the > shuffleindexfile size when putting into the cache and moved out from the > cache. > 3. *shuffle server load for shuffle block fetch requests* > 4. *breakdown of shuffle server block fetch requests load by remote executors* > We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a > new OpenBlocks message is received. > Open discussion for more metrics that could potentially influence the overall > shuffle service performance. > We can print out those metrics which are divided by applicationIDs in log, > since it is hard to define fixed key and use numerical value for this kind of > metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org