I was attempting to use the graphx triangle count method on a 2B edge graph
(Friendster dataset on SNAP)   and running into out of memory issue. I have
access to a 60 node cluster with 90GB memory and 30v cores per node .

 I am using 1000 partitions and using the RandomVertexCut. Here’s my submit
script :

spark-submit --executor-cores 5 --num-executors 100 --executor-memory 32g
--driver-memory 6g --conf spark.yarn.executor.memoryOverhead=8000  --conf
"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit” 
trianglecount_2.10-1.0.jar

There was one particular stage where it shuffled 3.7 TB

Active Stages (1)

Stage Id        Description     Submitted       Duration        Tasks: 
Succeeded/Total  Input   Output
Shuffle Read    Shuffle Write
11      (kill)mapPartitions at VertexRDDImpl.scala:218+details
2015/11/12 01:33:06     7.3 min 
316/344
22.6 GB         57.0 GB 3.7 TB
In this subsequent stage it fails reading the Shuffle around the half
terabyte mark with a java.lang.OutOfMemoryError: Java heap space


Active Stages (1)

Stage Id        Description     Submitted       Duration        Tasks: 
Succeeded/Total  Input   Output
Shuffle Read    Shuffle Write
12      (kill)mapPartitions at GraphImpl.scala:235+details
2015/11/12 01:41:25     5.2 min 
0/1000
26.3 GB         533.8 GB        




Compared to the spark benchmarking (http://arxiv.org/pdf/1402.2394v1.pdf)
cluster used on the twitter dataset (2.5B edges) the resources i am
providing for the job seem to be reasonable. Can anyone point out any
optimization or other tweaks i need to perform to get this to work ?

Thanks!
Vinod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/graphx-trianglecount-of-2B-edges-tp25371.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to