Re: Is there a way to decide what RDDs get cached in the Spark Runner?

Kyle Weaver Tue, 14 May 2019 09:48:51 -0700

Hi Augusto,

Right now the default behavior is to cache all intermediate RDDs that are
consumed more than once by the pipeline. This can be disabled with
`options.setCacheDisabled(true)` [1], but there is currently no way for the
user to specify to the runner that it should cache certain RDDs, but not
others.


There has recently been some discussion on the Slack (#spark-beam) about
implementing such a feature, but no concrete plans as of yet.

[1]
https://github.com/apache/beam/blob/81faf35c8a42493317eba9fa1e7b06fb42d54662/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L150

Thanks

Kyle Weaver | Software Engineer | github.com/ibzib | [email protected] |
+16502035555


*From: *[email protected] <[email protected]>
*Date: *Tue, May 14, 2019 at 5:01 AM
*To: * <[email protected]>

Hi,
>
> I guess the title says it all, right now it seems like BEAM caches all the
> intermediate RDD results for my pipeline when using the Spark runner, this
> leads to a very inefficient usage of memory. Any way to control this?
>
> Best regards,
> Augusto
>

Re: Is there a way to decide what RDDs get cached in the Spark Runner?

Reply via email to