Are you using the Kryo serializer? If not, have a look at it, it can save a lot 
of memory during shuffles

https://spark.apache.org/docs/latest/tuning.html

I did a similar task and had various issues with the volume of data being 
parsed in one go, but that helped a lot. It looks like the main difference from 
what you're doing to me is that my input classes were just a string and a byte 
array, which I then processed once it was read into the RDD, maybe your classes 
are memory heavy?


Thanks,
Ewan

-----Original Message-----
From: andrew.row...@thomsonreuters.com 
[mailto:andrew.row...@thomsonreuters.com] 
Sent: 27 August 2015 11:53
To: user@spark.apache.org
Subject: Driver running out of memory - caused by many tasks?

I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks 
(it’s reading a few TB of data). The job itself is fairly simple - it’s just 
getting a list of distinct values:

    val days = spark
      .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
      .sample(withReplacement = false, fraction = 0.01)
      .map(row => row._1.getTimestamp.toString("yyyy-MM-dd"))
      .distinct()
      .collect()

The cardinality of the ‘day’ is quite small - there’s only a handful. However, 
I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail 
with 24GB RAM, and am currently nudging it upwards to find out where it works. 
The ratio between input and shuffle write in the distinct stage is about 
3TB:7MB. On a smaller dataset, it works without issue on a smaller (4GB) heap. 
In YARN cluster mode, I get a failure message similar to:

    Container 
[pid=36844,containerID=container_e15_1438040390147_4982_01_000001] is running 
beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical memory 
used; 29.5 GB of 56.7 GB virtual memory used. Killing container.


Is the driver running out of memory simply due to the number of tasks, or is 
there something about the job program that’s causing it to put a lot of data 
into the driver heap and go oom? If the former, is there any general guidance 
about the amount of memory to give to the driver as a function of how many 
tasks there are?

Andrew

Reply via email to