Hello! If I understand correctly in Spark, the common pattern for multiple outputs is to collect them all into a single _persisted_ RDD internally, then filter into the separate RDDs (one per output) on demand.
Persisting is usually the right thing to do. Otherwise, spark could risk fusioning the "all" RDD representing the PCollectionTuple with each filter to get the PCollection (one per TupleTag), and recalculating "all" many times... or worse, recalculating the upstream RDDs continually if there's no fusion break or upstream persist. It looks like the cacheDisabled flag (https://issues.apache.org/jira/browse/BEAM-6053) is only considered for RDDs under PCollections that are reused in the job DAG. That does sounds like a bug to me, since the description of the flag implies all-or-nothing. I hope this helps, Ryan On Wed, Feb 26, 2020 at 11:17 AM Ajit Dongre <[email protected]> wrote: > > Hello, > > > > I am running simple Beam pipeline with Spark runner. > > > > I found in Beam's code that particular RDD is cached if corresponding DoFn is > using PCollectionTuple, mentioned in TransformTranslator.java (line number > 413) > > Want to know what is the need of such kind of caching ? > > > > Also SparkRunner option --cacheDisabled is not honoured at this code level. > Any specific reason ? > > > > Regards, > > Ajit Dongre
