Hi,
We've recently started seeing a huge increase in
spark.driver.maxResultSize - we are starting to set it at 3GB (and
increase our driver memory a lot to 12GB or so). This is on v1.6.1 with
Mesos scheduler.
All the docs I can see is that this is to do with .collect() being
called on a large RDD (which isn't the case AFAIK - certainly nothing in
the code) and it's rather puzzling me as to what's going on. I thought
that the number of tasks was coming into it (about 14000 tasks in each
of about a dozen stages). Adding a coalesce seemed to help but now we
are hitting the problem again after a few minor code tweaks.
What else could be contributing to this? Thoughts I've had:
- number of tasks
- metrics?
- um, a bit stuck!
The code looks like this:
df=....
df.persist()
val rows = df.count()
// actually we loop over this a few times
val output = df. groupBy("id").agg(
avg($"score").as("avg_score"),
count($"id").as("rows")
).
select(
$"id",
$"avg_score,
$"rows",
).sort($"id")
output.coalesce(1000).write.format("com.databricks.spark.csv").save('/tmp/...')
Cheers for any help/pointers! There are a couple of memory leak tickets
fixed in v1.6.2 that may affect the driver so I may try an upgrade (the
executors are fine).
Adrian
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org