[jira] [Comment Edited] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-19 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999832#comment-16999832
 ] 

huangweiyi edited comment on SPARK-30246 at 12/19/19 8:03 AM:
--

I'm working on this and have added a patch for verification last week, so far 
it looks good by monitoring streams size and nm's heap memory usage.  I will 
add a PR for this after verification.


was (Author: unclehuang):
I'm working on this and have added a patch for verification last week, so far 
it looks good by monitoring streams size and nm's heap memory usage. 

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-19 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999832#comment-16999832
 ] 

huangweiyi commented on SPARK-30246:


I'm working on this and have added a patch for verification last week, so far 
it looks good by monitoring streams size and nm's heap memory usage. 

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-19 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999830#comment-16999830
 ] 

huangweiyi commented on SPARK-30246:


hi , [~jfilipiak], I add a figure about the incomming references to 
StreamState::associatedChannel above.

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-18 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Description: 
In our large busy yarn cluster which deploy Spark external shuffle service as 
part of YARN NM aux service, we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!


incomming references to StreamState::associatedChannel: 
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!

  was:
In our large busy yarn cluster which deploy Spark external shuffle service as 
part of YARN NM aux service, we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!


!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!


> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-18 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Description: 
In our large busy yarn cluster which deploy Spark external shuffle service as 
part of YARN NM aux service, we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!


!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!

  was:
In our large busy yarn cluster which deploy Spark external shuffle service as 
part of YARN NM aux service, we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!




> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Description: 
In our large busy yarn cluster which deploy Spark external shuffle service as 
part of YARN NM aux service, we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



  was:
In our large busy yarn cluster which started  Spark external shuffle service on 
each NodeManager(NM), we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!




> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Description: 
In our large busy yarn cluster which started  Spark external shuffle service on 
each NodeManager(NM), we encountered OOM in some NMs.
after i dump the heap memory and found there are some StremState objects still 
in heap, but the app which the StreamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some StreaStates still remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



  was:
In our large busy yarn cluster which started  Spark external shuffle service on 
each NodeManager(NM), we encountered OOM in some NMs.
after i dump the heap memory and found there are some stremState objects still 
in heap, but the app which the streamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some streaStates remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!




> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Description: 
In our large busy yarn cluster which started  Spark external shuffle service on 
each NodeManager(NM), we encountered OOM in some NMs.
after i dump the heap memory and found there are some stremState objects still 
in heap, but the app which the streamState belongs to is already finished.

Here is some relate Figures:
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!

The heap dump below shows that the memory consumption mainly consists of two 
parts:
*(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
*(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*

!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!

dig into the OneForOneStreamManager, there are some streaStates remained :
!https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



  was:
In our large busy yarn cluster which started  Spark external shuffle service on 
each NodeManager(NM), we encountered OOM in some NMs.
after i dump the heap memory and found there are some stremState objects still 
in heap, but the app which the streamState belongs to is already finished.




> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some streaStates remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: (was: nm_heap_overview.png)

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: (was: streamState.png)

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: (was: nm_oom.png)

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: streamState.png

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
> Attachments: nm_heap_overview.png, nm_oom.png, streamState.png
>
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: (was: streamState.png)

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
> Attachments: nm_heap_overview.png, nm_oom.png
>
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: streamState.png

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
> Attachments: nm_heap_overview.png, nm_oom.png, streamState.png
>
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: nm_heap_overview.png

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
> Attachments: nm_heap_overview.png, nm_oom.png, streamState.png
>
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangweiyi updated SPARK-30246:
---
Attachment: nm_oom.png

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
> Attachments: nm_oom.png
>
>
> In our large busy yarn cluster which started  Spark external shuffle service 
> on each NodeManager(NM), we encountered OOM in some NMs.
> after i dump the heap memory and found there are some stremState objects 
> still in heap, but the app which the streamState belongs to is already 
> finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2019-12-12 Thread huangweiyi (Jira)
huangweiyi created SPARK-30246:
--

 Summary: Spark on Yarn External Shuffle Service Memory Leak
 Key: SPARK-30246
 URL: https://issues.apache.org/jira/browse/SPARK-30246
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 2.4.3
 Environment: hadoop 2.7.3
spark 2.4.3
jdk 1.8.0_60
Reporter: huangweiyi


In our large busy yarn cluster which started  Spark external shuffle service on 
each NodeManager(NM), we encountered OOM in some NMs.
after i dump the heap memory and found there are some stremState objects still 
in heap, but the app which the streamState belongs to is already finished.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-29 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940260#comment-16940260
 ] 

huangweiyi commented on SPARK-29273:


[~angerszhuuu]

welcome!  I have heared your nice work, pity did't see u

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-28 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940242#comment-16940242
 ] 

huangweiyi commented on SPARK-29273:


hi , [~angerszhuuu]  

the peakExecutionMemory value zero happend when replaying event. It's is a 
accumulated value and update when task is running, but the SparkListenerTaskEnd 
event does't include this info when logging to filesystem

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939466#comment-16939466
 ] 

huangweiyi commented on SPARK-29273:


i do the same thing like shs for replaying the spark-event-log,  when parsing 
the SparkListenerTaskEnd, i print out some metrics value, here is the sniped 
code

case taskEnd: SparkListenerTaskEnd => {

   info(s"peakExecutionMemory: ${taskEnd.taskMetrics.peakExecutionMemory}")
   info(s"executorRunTime: ${taskEnd.taskMetrics.executorRunTime}")
   info(s"executorCpuTime: ${taskEnd.taskMetrics.executorCpuTime}")

   ...

}

 

here is the output is :

19/09/27 21:31:40 INFO SparkFSProcessor: peakExecutionMemory: 0
19/09/27 21:31:40 INFO SparkFSProcessor: executorRunTime: 1253
19/09/27 21:31:40 INFO SparkFSProcessor: executorCpuTime: 924518630

 

and i add a pr to this issue, please help review, many thans!

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread huangweiyi (Jira)
huangweiyi created SPARK-29273:
--

 Summary: Spark peakExecutionMemory metrics is zero
 Key: SPARK-29273
 URL: https://issues.apache.org/jira/browse/SPARK-29273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
 Environment: hadoop 2.7.3

spark 2.4.3

jdk 1.8.0_60
Reporter: huangweiyi


with spark 2.4.3 in our production environment, i want to get the 
peakExecutionMemory which is exposed by the TaskMetrics, but alway get the zero 
value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org