Spark Union Breaks Caching Behaviour

Yi Huang Tue, 07 Apr 2020 09:45:15 -0700

Dear Community,

I am a beginner of using Spark. I am confused by the comment of the
following method.


def union(other: Dataset[T]): Dataset[T] = withSetOperator {
  // This breaks caching, but it's usually ok because it addresses a very
specific use case:
  // using union to union many files or partitions.
  CombineUnions(Union(logicalPlan,
other.logicalPlan)).mapChildren(AnalysisBarrier)
}

and here is the corresponding PR comment
https://github.com/apache/spark/pull/10577#discussion_r48820132


Another option would just be to do this at construction time, that way we
can avoid paying the cost in the analyzer. *This would still limit the
cases we could cache (i.e. we'd miss cached data unioned with other data),
but that doesn't seem like a huge deal.*


Could anyone please kindly explain to me what does *This breaks caching *mean?
It would be awesome if an example is given.

Best regards,
Yi Huang

Spark Union Breaks Caching Behaviour

Reply via email to