Hi Augusto, Right now the default behavior is to cache all intermediate RDDs that are consumed more than once by the pipeline. This can be disabled with `options.setCacheDisabled(true)` [1], but there is currently no way for the user to specify to the runner that it should cache certain RDDs, but not others.
There has recently been some discussion on the Slack (#spark-beam) about implementing such a feature, but no concrete plans as of yet. [1] https://github.com/apache/beam/blob/81faf35c8a42493317eba9fa1e7b06fb42d54662/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L150 Thanks Kyle Weaver | Software Engineer | github.com/ibzib | [email protected] | +16502035555 *From: *[email protected] <[email protected]> *Date: *Tue, May 14, 2019 at 5:01 AM *To: * <[email protected]> Hi, > > I guess the title says it all, right now it seems like BEAM caches all the > intermediate RDD results for my pipeline when using the Spark runner, this > leads to a very inefficient usage of memory. Any way to control this? > > Best regards, > Augusto >
