Hello, I believe, that currently non-deterministic expressions are handled in two conflicting approaches in the catalyst optimizer.
The first approach is the one I have seen in the recent pull request reviews - the optimizer should never change the number of times a non-deterministic expression is executed. A good example of this is /`Canonicalize.scala`/: * In addition and multiplication we allow reordering non-deterministic expressions, because both sides will be evaluated anyways. * In boolean OR and AND we *do not* allow reordering non-deterministic expressions, because the right side might not be evaluated. Then there is another approach, where we allow reordering non-deterministic expressions even in boolean OR and AND. A good example of this is the /`PushPredicateThroughJoin`/ rule where we use the /`condition.partition(_.deterministic)`/ pattern. Later the partitioned expressions can be concatenated back, but this effectively changes the order of execution and can make some non-deterministic expressions be not evaluated on all the rows they would have been. Initially I was sure, that the second approach is wrong and was about to make a pull request to fix this. But then I found that this has not been an accidental mistake, but it is done so on purpose: https://github.com/apache/spark/pull/20069 <https://github.com/apache/spark/pull/20069> . I'm sure that both of these approaches have good arguments for them. In my eyes: * The first one allows users be more sure on how their stateful expressions are evaluated - optimizer does not change the output. * The second one allows catalyst to do better optimization. But, by mixing both of them we get the worst of the both worlds - users can't be sure about how the expressions are evaluated and we don't have the "most optimal" queries. What is the community's stance on this issue? Regards, Tanel -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org