I was attempting to use the graphx triangle count method on a 2B edge graph (Friendster dataset on SNAP) . I have access to a 60 node cluster with 90GB memory and 30v cores per node . I am running into memory issues
I am using 1000 partitions using the RandomVertexCut. Here’s my submit script : spark-submit --executor-cores 5 --num-executors 100 --executor-memory 32g --driver-memory 6g --conf spark.yarn.executor.memoryOverhead=8000 --conf "spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit” trianglecount_2.10-1.0.jar There was one particular stage where it shuffled 3.7 TB Active Stages (1) Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Output Shuffle Read Shuffle Write 11 (kill <http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage/kill?id=11&terminate=true>)mapPartitions at VertexRDDImpl.scala:218 <http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage?id=11&attempt=0>+details <http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/storage/rdd?id=38> <http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/storage/rdd?id=24> 2015/11/12 01:33:06 7.3 min 316/344 22.6 GB 57.0 GB 3.7 TB In this subsequent stage it fails reading the Shuffle around the half terabyte mark with a java.lang.OutOfMemoryError: Java heap space Active Stages (1) Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Output Shuffle Read Shuffle Write 12 (kill <http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage/kill?id=12&terminate=true>)mapPartitions at GraphImpl.scala:235 <http://xd-rm.xdata.data-tactics-corp.com:8034/proxy/application_1446122799268_0276/stages/stage?id=12&attempt=0>+details 2015/11/12 01:41:25 5.2 min 0/1000 26.3 GB 533.8 GB Compared to the benchmarking (http://arxiv.org/pdf/1402.2394v1.pdf <http://arxiv.org/pdf/1402.2394v1.pdf>) cluster used on the twitter dataset (2.5B edges) the resources i am providing for the job seem to be reasonable. Can anyone point out any optimization or other tweaks i need to perform to get this to work ? Thanks! Vinod