Re: RDD Caching in SparkRunner

Kyle Weaver Wed, 26 Feb 2020 09:54:19 -0800

> Persisting is usually the right thing to do.

+1, cacheDisabled should only be used if you're certain that *in aggregate*
recomputation is faster than writing to and reading from the cache. Keep in
mind that cacheDisabled applies to the whole pipeline, meaning you're out
of luck if you want to recompute only certain PCollections. There was
discussion about maybe annotating PCollections to achieve that, but I don't
think anything ever came of it.


That being said, I agree that cacheDisabled being ignored there sounds like
a bug. I filed https://jira.apache.org/jira/browse/BEAM-9387 to track the
issue.

Thanks,
Kyle

On Wed, Feb 26, 2020 at 7:53 AM Ryan Skraba <[email protected]> wrote:

> Hello!
>
> If I understand correctly in Spark, the common pattern for multiple
> outputs is to collect them all into a single _persisted_ RDD
> internally, then filter into the separate RDDs (one per output) on
> demand.
>
> Persisting is usually the right thing to do.  Otherwise, spark could
> risk fusioning the "all" RDD representing the PCollectionTuple with
> each filter to get the PCollection (one per TupleTag), and
> recalculating "all" many times... or worse, recalculating the upstream
> RDDs continually if there's no fusion break or upstream persist.
>
> It looks like the cacheDisabled flag
> (https://issues.apache.org/jira/browse/BEAM-6053) is only considered
> for RDDs under PCollections that are reused in the job DAG.  That does
> sounds like a bug to me, since the description of the flag implies
> all-or-nothing.
>
> I hope this helps, Ryan
>
> On Wed, Feb 26, 2020 at 11:17 AM Ajit Dongre
> <[email protected]> wrote:
> >
> > Hello,
> >
> >
> >
> > I am running simple Beam pipeline with Spark runner.
> >
> >
> >
> > I found in Beam's code that particular RDD is cached if corresponding
> DoFn is using PCollectionTuple, mentioned in TransformTranslator.java
> (line number 413)
> >
> > Want to know what is the need of such kind of caching ?
> >
> >
> >
> > Also SparkRunner option --cacheDisabled is not honoured at this code
> level. Any specific reason ?
> >
> >
> >
> > Regards,
> >
> > Ajit Dongre
>

Re: RDD Caching in SparkRunner

Reply via email to