GraphFrames cluster staling with a large dataset and pyspark

Alexander Czech Mon, 26 Aug 2019 03:24:40 -0700

Hey All,
I'm trying to run a pagerank with GraphFrames on a large graph (about 90
billion edges and 1.4TB total disk size) and I'm running into some issues.
The code is very simplistic it's just load dataframes from S3 and put them
into the GraphFrames pagerank function. But when I run it the cluster is
very slow and produces high garbage collection times. So I followed these
two articles in optimizing gc:
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/


After implementing these I also resized my executors to 5 cores and 36g of
memory. Everything worked perfectly for a couple of test runs on either on
75 or 151 executors. Sadly I did not finish the any of the test run
calculations because I wanted to deploy the app on a larger cluster but now
looking back It might have just worked on those test clusters. Anyway I
tried to deploy to a cluster with 300+ executors and now the cluster was
staling hard again. Even going back to the known good configuration of the
test clusters didn't help because the cluster was staling there as well.
(I'm running the cluster on EMR so going back to an old config is just
cloning the cluster that once worked)

The cluster staling comes in two "types".
Either the cluster maps the files in the two S3 parquet directories (nodes
and edges) as 2 tasks and then halts for about 10 minutes before executors
report high gc times and shuffle writes. The executor dashboard looks like
this <https://drive.google.com/open?id=1fytT7lgiWSESUOhN-gM8tS084roQEbFy>
for about 10 minutes.
Or the cluster runs right away but with high gc times and very slow, then
the executor dashboard looks like this.
<https://drive.google.com/file/d/1nS0_mYWREdqUKAA3m9RSTh7-PRu53NXW/view?usp=sharing>

Now as I said the crazy thing is that it worked same configurations same
code and I have no Idea why it isn't now.

when I look at ganglia the cluster traffic looks like this for the failed
attempts: traffic
<https://drive.google.com/file/d/1vbkAcWjLKOvlTRDLnrzfO-M7i2RX4EZG/view?usp=sharing>
(each spike is one restart of the app)
When it was working the traffic would peak shortly at >2GB/s and then level
of at around 1GB/s . When the app is failing it spikes shortly to about 2GB
and then going down to 0 again like in the screenshot.

So did I just hit a bug in the GraphFrames library or do I miss something
fundamental?
I tried:
resizing the executors
resizing the driver
resizing the cluster
changing the config parameters described in the articles

Nothing seems to work my current working theory is that it just works on
Tuesdays ...

I use Spark 2.4.3 (pyspark)
on EMR 2.5.6
with GraphFrames 0.7.0-spark2.4-s_2.11

GraphFrames cluster staling with a large dataset and pyspark

Reply via email to