[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
[ https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19098: - Labels: bulk-closed (was: ) > Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations > - > > Key: SPARK-19098 > URL: https://issues.apache.org/jira/browse/SPARK-19098 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.2, 2.1.0 > Environment: Linux x64 > Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0) > Spark on YARN, dynamic allocation with shuffle service > Input/Output data on HDFS > kryo serialization turned on > checkpointing directory set on HDFS >Reporter: Steven Ruppert >Priority: Minor > Labels: bulk-closed > Attachments: Screen Shot 2017-01-30 at 18.36.43-fullpage.png, > doubling-season.png > > > I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla > ConnectedComponents use, notably one that works fine with identical code on > spark 2.0.1, but not on 2.1.0. > I unfortunately haven't narrowed this down to a test case yet, nor do I have > access to the original logs, so this initial report will be a little vague. > However, this behavior as described might ring a bell to somebody. > Roughly: > {noformat} > val edges: RDD[Edge[Int]] = _ // from file > val vertices: RDD[(VertexId, Int)] = _ // from file > val graph = Graph(vertices, edges) > val components: RDD[(VertexId, ComponentId)] = ConnectedComponents > .run(graph, 10) > .vertices > {noformat} > Running this against my input of ~5B edges and ~3B vertices leads to a > strange doubling of shuffle traffic in each round of Pregel (inside > ConnectedComponents), increasing from the actual data size of ~50 GB, to > 100GB, to 200GB, all the way to around 40TB before I killed the job. The data > being shuffled was apparently an RDD of ShippableVertexPartition . > Oddly enough, only the kryo-serialized shuffled data doubled in size. The > heap usage of the executors themselves remained stable, or at least did not > account 1 to 1 for the 40TB of shuffled data, for I definitely do not have > 40TB of RAM. Furthermore, I also have kryo reference tracking turned on > still, so whatever is leaking somehow gets around that. > I'll update this ticket once I have more details, unless somebody else with > the same problem reports back first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
[ https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19098: -- Priority: Minor (was: Critical) > Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations > - > > Key: SPARK-19098 > URL: https://issues.apache.org/jira/browse/SPARK-19098 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.1.0 > Environment: Linux x64 > Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0) > Spark on YARN, dynamic allocation with shuffle service > Input/Output data on HDFS > kryo serialization turned on > checkpointing directory set on HDFS >Reporter: Steven Ruppert >Priority: Minor > Attachments: doubling-season.png > > > I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla > ConnectedComponents use, notably one that works fine with identical code on > spark 2.0.1, but not on 2.1.0. > I unfortunately haven't narrowed this down to a test case yet, nor do I have > access to the original logs, so this initial report will be a little vague. > However, this behavior as described might ring a bell to somebody. > Roughly: > {noformat} > val edges: RDD[Edge[Int]] = _ // from file > val vertices: RDD[(VertexId, Int)] = _ // from file > val graph = Graph(vertices, edges) > val components: RDD[(VertexId, ComponentId)] = ConnectedComponents > .run(graph, 10) > .vertices > {noformat} > Running this against my input of ~5B edges and ~3B vertices leads to a > strange doubling of shuffle traffic in each round of Pregel (inside > ConnectedComponents), increasing from the actual data size of ~50 GB, to > 100GB, to 200GB, all the way to around 40TB before I killed the job. The data > being shuffled was apparently an RDD of ShippableVertexPartition . > Oddly enough, only the kryo-serialized shuffled data doubled in size. The > heap usage of the executors themselves remained stable, or at least did not > account 1 to 1 for the 40TB of shuffled data, for I definitely do not have > 40TB of RAM. Furthermore, I also have kryo reference tracking turned on > still, so whatever is leaking somehow gets around that. > I'll update this ticket once I have more details, unless somebody else with > the same problem reports back first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
[ https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Ruppert updated SPARK-19098: --- Attachment: doubling-season.png Screenshot of the spark UI for the job, showing the doubling effect. > Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations > - > > Key: SPARK-19098 > URL: https://issues.apache.org/jira/browse/SPARK-19098 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.1.0 > Environment: Linux x64 > Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0) > Spark on YARN, dynamic allocation with shuffle service > Input/Output data on HDFS > kryo serialization turned on > checkpointing directory set on HDFS >Reporter: Steven Ruppert >Priority: Critical > Attachments: doubling-season.png > > > I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla > ConnectedComponents use, notably one that works fine with identical code on > spark 2.0.1, but not on 2.1.0. > I unfortunately haven't narrowed this down to a test case yet, nor do I have > access to the original logs, so this initial report will be a little vague. > However, this behavior as described might ring a bell to somebody. > Roughly: > {noformat} > val edges: RDD[Edge[Int]] = _ // from file > val vertices: RDD[(VertexId, Int)] = _ // from file > val graph = Graph(vertices, edges) > val components: RDD[(VertexId, ComponentId)] = ConnectedComponents > .run(graph, 10) > .vertices > {noformat} > Running this against my input of ~5B edges and ~3B vertices leads to a > strange doubling of shuffle traffic in each round of Pregel (inside > ConnectedComponents), increasing from the actual data size of ~50 GB, to > 100GB, to 200GB, all the way to around 40TB before I killed the job. The data > being shuffled was apparently an RDD of ShippableVertexPartition . > Oddly enough, only the kryo-serialized shuffled data doubled in size. The > heap usage of the executors themselves remained stable, or at least did not > account 1 to 1 for the 40TB of shuffled data, for I definitely do not have > 40TB of RAM. Furthermore, I also have kryo reference tracking turned on > still, so whatever is leaking somehow gets around that. > I'll update this ticket once I have more details, unless somebody else with > the same problem reports back first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org