[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19098:
-
Labels: bulk-closed  (was: )

> Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
> -
>
> Key: SPARK-19098
> URL: https://issues.apache.org/jira/browse/SPARK-19098
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2, 2.1.0
> Environment: Linux x64
> Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
> Spark on YARN, dynamic allocation with shuffle service
> Input/Output data on HDFS
> kryo serialization turned on
> checkpointing directory set on HDFS
>Reporter: Steven Ruppert
>Priority: Minor
>  Labels: bulk-closed
> Attachments: Screen Shot 2017-01-30 at 18.36.43-fullpage.png, 
> doubling-season.png
>
>
> I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla 
> ConnectedComponents use, notably one that works fine with identical code on 
> spark 2.0.1, but not on 2.1.0.
> I unfortunately haven't narrowed this down to a test case yet, nor do I have 
> access to the original logs, so this initial report will be a little vague. 
> However, this behavior as described might ring a bell to somebody.
> Roughly: 
> {noformat}
> val edges: RDD[Edge[Int]] = _ // from file
> val vertices: RDD[(VertexId, Int)] = _ // from file
> val graph = Graph(vertices, edges)
> val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
>   .run(graph, 10)
>   .vertices
> {noformat}
> Running this against my input of ~5B edges and ~3B vertices leads to a 
> strange doubling of shuffle traffic in each round of Pregel (inside 
> ConnectedComponents), increasing from the actual data size of ~50 GB, to 
> 100GB, to 200GB, all the way to around 40TB before I killed the job. The data 
> being shuffled was apparently an RDD of ShippableVertexPartition .
> Oddly enough, only the kryo-serialized shuffled data doubled in size. The 
> heap usage of the executors themselves remained stable, or at least did not 
> account 1 to 1 for the 40TB of shuffled data, for I definitely do not have 
> 40TB of RAM. Furthermore, I also have kryo reference tracking turned on 
> still, so whatever is leaking somehow gets around that.
> I'll update this ticket once I have more details, unless somebody else with 
> the same problem reports back first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19098:
--
Priority: Minor  (was: Critical)

> Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
> -
>
> Key: SPARK-19098
> URL: https://issues.apache.org/jira/browse/SPARK-19098
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.0
> Environment: Linux x64
> Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
> Spark on YARN, dynamic allocation with shuffle service
> Input/Output data on HDFS
> kryo serialization turned on
> checkpointing directory set on HDFS
>Reporter: Steven Ruppert
>Priority: Minor
> Attachments: doubling-season.png
>
>
> I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla 
> ConnectedComponents use, notably one that works fine with identical code on 
> spark 2.0.1, but not on 2.1.0.
> I unfortunately haven't narrowed this down to a test case yet, nor do I have 
> access to the original logs, so this initial report will be a little vague. 
> However, this behavior as described might ring a bell to somebody.
> Roughly: 
> {noformat}
> val edges: RDD[Edge[Int]] = _ // from file
> val vertices: RDD[(VertexId, Int)] = _ // from file
> val graph = Graph(vertices, edges)
> val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
>   .run(graph, 10)
>   .vertices
> {noformat}
> Running this against my input of ~5B edges and ~3B vertices leads to a 
> strange doubling of shuffle traffic in each round of Pregel (inside 
> ConnectedComponents), increasing from the actual data size of ~50 GB, to 
> 100GB, to 200GB, all the way to around 40TB before I killed the job. The data 
> being shuffled was apparently an RDD of ShippableVertexPartition .
> Oddly enough, only the kryo-serialized shuffled data doubled in size. The 
> heap usage of the executors themselves remained stable, or at least did not 
> account 1 to 1 for the 40TB of shuffled data, for I definitely do not have 
> 40TB of RAM. Furthermore, I also have kryo reference tracking turned on 
> still, so whatever is leaking somehow gets around that.
> I'll update this ticket once I have more details, unless somebody else with 
> the same problem reports back first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations

2017-01-05 Thread Steven Ruppert (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Ruppert updated SPARK-19098:
---
Attachment: doubling-season.png

Screenshot of the spark UI for the job, showing the doubling effect.

> Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
> -
>
> Key: SPARK-19098
> URL: https://issues.apache.org/jira/browse/SPARK-19098
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.0
> Environment: Linux x64
> Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
> Spark on YARN, dynamic allocation with shuffle service
> Input/Output data on HDFS
> kryo serialization turned on
> checkpointing directory set on HDFS
>Reporter: Steven Ruppert
>Priority: Critical
> Attachments: doubling-season.png
>
>
> I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla 
> ConnectedComponents use, notably one that works fine with identical code on 
> spark 2.0.1, but not on 2.1.0.
> I unfortunately haven't narrowed this down to a test case yet, nor do I have 
> access to the original logs, so this initial report will be a little vague. 
> However, this behavior as described might ring a bell to somebody.
> Roughly: 
> {noformat}
> val edges: RDD[Edge[Int]] = _ // from file
> val vertices: RDD[(VertexId, Int)] = _ // from file
> val graph = Graph(vertices, edges)
> val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
>   .run(graph, 10)
>   .vertices
> {noformat}
> Running this against my input of ~5B edges and ~3B vertices leads to a 
> strange doubling of shuffle traffic in each round of Pregel (inside 
> ConnectedComponents), increasing from the actual data size of ~50 GB, to 
> 100GB, to 200GB, all the way to around 40TB before I killed the job. The data 
> being shuffled was apparently an RDD of ShippableVertexPartition .
> Oddly enough, only the kryo-serialized shuffled data doubled in size. The 
> heap usage of the executors themselves remained stable, or at least did not 
> account 1 to 1 for the 40TB of shuffled data, for I definitely do not have 
> 40TB of RAM. Furthermore, I also have kryo reference tracking turned on 
> still, so whatever is leaking somehow gets around that.
> I'll update this ticket once I have more details, unless somebody else with 
> the same problem reports back first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org