What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon
I have a job that is running into intermittent errors with  [SparkDriver]
java.lang.OutOfMemoryError: Java heap space.  Before I was getting this
error I was getting errors saying the result size exceed the
spark.driver.maxResultSize.
This does not make any sense to me, as there are no actions in my job that
send data to the driver - just a pull of data from S3, a map and
reduceByKey and then conversion to dataframe and saveAsTable action that
puts the results back on S3.

I've found a few references to reduceByKey and spark.driver.maxResultSize
having some importance, but cannot fathom how this setting could be related.

Would greatly appreciated any advice.

Thanks in advance,

Tom


Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang
I think you are fetching too many results to the driver. Typically, it is not 
recommended to collect much data to driver. But if you have to, you can 
increase the driver memory, when submitting jobs.

Thanks.

Zhan Zhang

On Dec 11, 2015, at 6:14 AM, Tom Seddon 
> wrote:

I have a job that is running into intermittent errors with  [SparkDriver] 
java.lang.OutOfMemoryError: Java heap space.  Before I was getting this error I 
was getting errors saying the result size exceed the 
spark.driver.maxResultSize.  This does not make any sense to me, as there are 
no actions in my job that send data to the driver - just a pull of data from 
S3, a map and reduceByKey and then conversion to dataframe and saveAsTable 
action that puts the results back on S3.

I've found a few references to reduceByKey and spark.driver.maxResultSize 
having some importance, but cannot fathom how this setting could be related.

Would greatly appreciated any advice.

Thanks in advance,

Tom



Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi
Do you have a large number of tasks? This can happen if you have a large
number of tasks and a small driver or if you use accumulators of lists like
datastructures.

2015-12-11 11:17 GMT-08:00 Zhan Zhang :

> I think you are fetching too many results to the driver. Typically, it is
> not recommended to collect much data to driver. But if you have to, you can
> increase the driver memory, when submitting jobs.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 11, 2015, at 6:14 AM, Tom Seddon  wrote:
>
> I have a job that is running into intermittent errors with  [SparkDriver]
> java.lang.OutOfMemoryError: Java heap space.  Before I was getting this
> error I was getting errors saying the result size exceed the 
> spark.driver.maxResultSize.
> This does not make any sense to me, as there are no actions in my job that
> send data to the driver - just a pull of data from S3, a map and
> reduceByKey and then conversion to dataframe and saveAsTable action that
> puts the results back on S3.
>
> I've found a few references to reduceByKey and spark.driver.maxResultSize
> having some importance, but cannot fathom how this setting could be related.
>
> Would greatly appreciated any advice.
>
> Thanks in advance,
>
> Tom
>
>
>