Dear Spark users

Is it possible to take the output of a transformation (RDD/Dataframe) and
feed it to two independent transformations without recalculating the first
transformation and without caching the whole dataset?

Consider the case of a very large dataset (1+TB) which suffered several
transformations and now we want to save it but also calculate some
statistics per group.
So the best processing way would for: for each partition: do task A, do
task B.

I don't see a way of instructing spark how to proceed that way without
caching to disk, which seems unnecessarily heavy. And if we don't cache
spark recalculates every partition all the way from the beginning. In
either case huge file reads happen.

Any ideas on how to avoid it?

Thanks
Fernando

Reply via email to