Hi,
We have a quite long winded Spark application we inherited with many stages. 
When we run on our spark cluster, things start off well enough. Workers are 
busy, lots of progress made, etc. etc. However, 30 minutes into processing, we 
see CPU usage of the workers drop drastically. At this time, we also see that 
the driver is maxing out exactly one core (though we've given it more than 
one), and its ram usage is creeping up. At this time, there's no logs coming 
out on the driver. Everything seems to stop, and then it suddenly starts 
working, and the workers start working again. The driver ram doesn't go down, 
but flatlines. A few minutes later, the same thing happens again - the world 
seems to stop. However, the driver soon crashes with an out of memory exception.

What could be causing this sort of behaviour on the driver? We don't have any 
collect() or similar functions in the code. We're reading in from Azure blobs, 
processing, and writing back to Azure blobs. Where should we start in trying to 
get to the bottom of this? We're running Spark 2.4.1 in a stand-alone cluster.

Thanks,
Ashic.

Reply via email to