JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder
thread-safety bug in createDataset(Seq)
URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354
> But if we would like to fix all these problems, all public APIs accepting
`Encoder` will need the copy.
I think that most existing uses of Encoders are thread-safe because either
(a) the use occurs inside of a Spark task and task gets its own fresh copy when
the `Task` is deserialized or (b) the use occurs on the driver but the code
calls call `resolveAndBind` (which internally performs a `copy`) prior to using
the Encoder.
Given this, I suspect that this might be the only non-thread-safe Encoder
usage in Spark (excluding code which is only used in Spark's unit tests). I
don't think that we need to introduce similar copying in other public APIs.
> I did some research about this and found some noticeable performance
regression in our internal benchmark.
What do you think about improving the performance / reducing the cost of
`.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the
immutable `vals` become fields of the case class, (b) the current constructor
becomes a `.apply()` on the companion object and the case class constructor
becomes `private`, and (c) `resolveAndBind` calls the companion object
constructor instead of `copy()`? Given this, I think `copy()` could be _really_
cheap, effectively giving us a fresh copy of the internal mutable state but
copying all other immutable attributes without performing any re-resolution,
analysis, attribute binding, etc.
If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one
object allocation) and then could copy-by-default and free users from having to
worry about thread-safety.
I think that's a potentially huge win from a developer productivity
point-of-view: the cost / toil of having to worry about thread-unsafe code is a
tax placed on end users and creates a developer education / training burden, so
I think it's worth attempting to eliminate this entire class of pitfall.
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org