[jira] [Comment Edited] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999832#comment-16999832 ] huangweiyi edited comment on SPARK-30246 at 12/19/19 8:03 AM: -- I'm working on this and have added a patch for verification last week, so far it looks good by monitoring streams size and nm's heap memory usage. I will add a PR for this after verification. was (Author: unclehuang): I'm working on this and have added a patch for verification last week, so far it looks good by monitoring streams size and nm's heap memory usage. > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999832#comment-16999832 ] huangweiyi commented on SPARK-30246: I'm working on this and have added a patch for verification last week, so far it looks good by monitoring streams size and nm's heap memory usage. > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999830#comment-16999830 ] huangweiyi commented on SPARK-30246: hi , [~jfilipiak], I add a figure about the incomming references to StreamState::associatedChannel above. > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Description: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! incomming references to StreamState::associatedChannel: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! was: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Description: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! was: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Description: In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! was: In our large busy yarn cluster which started Spark external shuffle service on each NodeManager(NM), we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Description: In our large busy yarn cluster which started Spark external shuffle service on each NodeManager(NM), we encountered OOM in some NMs. after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some StreaStates still remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! was: In our large busy yarn cluster which started Spark external shuffle service on each NodeManager(NM), we encountered OOM in some NMs. after i dump the heap memory and found there are some stremState objects still in heap, but the app which the streamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some streaStates remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Description: In our large busy yarn cluster which started Spark external shuffle service on each NodeManager(NM), we encountered OOM in some NMs. after i dump the heap memory and found there are some stremState objects still in heap, but the app which the streamState belongs to is already finished. Here is some relate Figures: !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! The heap dump below shows that the memory consumption mainly consists of two parts: *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! dig into the OneForOneStreamManager, there are some streaStates remained : !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! was: In our large busy yarn cluster which started Spark external shuffle service on each NodeManager(NM), we encountered OOM in some NMs. after i dump the heap memory and found there are some stremState objects still in heap, but the app which the streamState belongs to is already finished. > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some streaStates remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: (was: nm_heap_overview.png) > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: (was: streamState.png) > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: (was: nm_oom.png) > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: streamState.png > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > Attachments: nm_heap_overview.png, nm_oom.png, streamState.png > > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: (was: streamState.png) > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > Attachments: nm_heap_overview.png, nm_oom.png > > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: streamState.png > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > Attachments: nm_heap_overview.png, nm_oom.png, streamState.png > > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: nm_heap_overview.png > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > Attachments: nm_heap_overview.png, nm_oom.png, streamState.png > > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangweiyi updated SPARK-30246: --- Attachment: nm_oom.png > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > Attachments: nm_oom.png > > > In our large busy yarn cluster which started Spark external shuffle service > on each NodeManager(NM), we encountered OOM in some NMs. > after i dump the heap memory and found there are some stremState objects > still in heap, but the app which the streamState belongs to is already > finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
huangweiyi created SPARK-30246: -- Summary: Spark on Yarn External Shuffle Service Memory Leak Key: SPARK-30246 URL: https://issues.apache.org/jira/browse/SPARK-30246 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 2.4.3 Environment: hadoop 2.7.3 spark 2.4.3 jdk 1.8.0_60 Reporter: huangweiyi In our large busy yarn cluster which started Spark external shuffle service on each NodeManager(NM), we encountered OOM in some NMs. after i dump the heap memory and found there are some stremState objects still in heap, but the app which the streamState belongs to is already finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero
[ https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940260#comment-16940260 ] huangweiyi commented on SPARK-29273: [~angerszhuuu] welcome! I have heared your nice work, pity did't see u > Spark peakExecutionMemory metrics is zero > - > > Key: SPARK-29273 > URL: https://issues.apache.org/jira/browse/SPARK-29273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > with spark 2.4.3 in our production environment, i want to get the > peakExecutionMemory which is exposed by the TaskMetrics, but alway get the > zero value -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero
[ https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940242#comment-16940242 ] huangweiyi commented on SPARK-29273: hi , [~angerszhuuu] the peakExecutionMemory value zero happend when replaying event. It's is a accumulated value and update when task is running, but the SparkListenerTaskEnd event does't include this info when logging to filesystem > Spark peakExecutionMemory metrics is zero > - > > Key: SPARK-29273 > URL: https://issues.apache.org/jira/browse/SPARK-29273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > with spark 2.4.3 in our production environment, i want to get the > peakExecutionMemory which is exposed by the TaskMetrics, but alway get the > zero value -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero
[ https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939466#comment-16939466 ] huangweiyi commented on SPARK-29273: i do the same thing like shs for replaying the spark-event-log, when parsing the SparkListenerTaskEnd, i print out some metrics value, here is the sniped code case taskEnd: SparkListenerTaskEnd => { info(s"peakExecutionMemory: ${taskEnd.taskMetrics.peakExecutionMemory}") info(s"executorRunTime: ${taskEnd.taskMetrics.executorRunTime}") info(s"executorCpuTime: ${taskEnd.taskMetrics.executorCpuTime}") ... } here is the output is : 19/09/27 21:31:40 INFO SparkFSProcessor: peakExecutionMemory: 0 19/09/27 21:31:40 INFO SparkFSProcessor: executorRunTime: 1253 19/09/27 21:31:40 INFO SparkFSProcessor: executorCpuTime: 924518630 and i add a pr to this issue, please help review, many thans! > Spark peakExecutionMemory metrics is zero > - > > Key: SPARK-29273 > URL: https://issues.apache.org/jira/browse/SPARK-29273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > > with spark 2.4.3 in our production environment, i want to get the > peakExecutionMemory which is exposed by the TaskMetrics, but alway get the > zero value -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29273) Spark peakExecutionMemory metrics is zero
huangweiyi created SPARK-29273: -- Summary: Spark peakExecutionMemory metrics is zero Key: SPARK-29273 URL: https://issues.apache.org/jira/browse/SPARK-29273 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Environment: hadoop 2.7.3 spark 2.4.3 jdk 1.8.0_60 Reporter: huangweiyi with spark 2.4.3 in our production environment, i want to get the peakExecutionMemory which is exposed by the TaskMetrics, but alway get the zero value -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org