Re: RDD Caching in SparkRunner

Ryan Skraba Wed, 26 Feb 2020 07:53:53 -0800

Hello!

If I understand correctly in Spark, the common pattern for multiple
outputs is to collect them all into a single _persisted_ RDD
internally, then filter into the separate RDDs (one per output) on
demand.

Persisting is usually the right thing to do.  Otherwise, spark could
risk fusioning the "all" RDD representing the PCollectionTuple with
each filter to get the PCollection (one per TupleTag), and
recalculating "all" many times... or worse, recalculating the upstream
RDDs continually if there's no fusion break or upstream persist.

It looks like the cacheDisabled flag
(https://issues.apache.org/jira/browse/BEAM-6053) is only considered
for RDDs under PCollections that are reused in the job DAG.  That does
sounds like a bug to me, since the description of the flag implies
all-or-nothing.

I hope this helps, Ryan

On Wed, Feb 26, 2020 at 11:17 AM Ajit Dongre
<[email protected]> wrote:
>
> Hello,
>
>
>
> I am running simple Beam pipeline with Spark runner.
>
>
>
> I found in Beam's code that particular RDD is cached if corresponding DoFn is 
> using PCollectionTuple, mentioned in TransformTranslator.java  (line number 
> 413)
>
> Want to know what is the need of such kind of caching ?
>
>
>
> Also SparkRunner option --cacheDisabled is not honoured at this code level. 
> Any specific reason ?
>
>
>
> Regards,
>
> Ajit Dongre

Re: RDD Caching in SparkRunner

Reply via email to