[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)

2020-01-18 Thread GitBox
JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder 
thread-safety bug in createDataset(Seq)
URL: https://github.com/apache/spark/pull/26076#issuecomment-575970963
 
 
   Hi @dongjoon-hyun,
   
   I spent some time prototyping a refactoring of `ExpressionEncoder` which 
separates the mutable and immutable state (in order to significantly reduce the 
cost of `.copy()`). This _is_ doable but ends up involving a lot of code 
movement and potentially some duplication (since some helper functions are in 
both the constructor and after construction). I think that's definitely the 
right long-term approach but I'll need some more time to figure out a 
minimally-invasive and clean way of making that change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)

2019-10-10 Thread GitBox
JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder 
thread-safety bug in createDataset(Seq)
URL: https://github.com/apache/spark/pull/26076#issuecomment-540824919
 
 
   > That sounds like a good idea to me. Can we do that first?
   
   I'll prototype this. If I get it working then I'll open a second PR and will 
ping / link it here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)

2019-10-09 Thread GitBox
JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder 
thread-safety bug in createDataset(Seq)
URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354
 
 
   > But if we would like to fix all these problems, all public APIs accepting 
`Encoder` will need the copy.
   
   I think that most existing uses of Encoders are thread-safe because either 
(a) the use occurs inside of a Spark task and task gets its own fresh copy when 
the `Task` is deserialized or (b) the use occurs on the driver but the code 
calls call `resolveAndBind` (which internally performs a `copy`) prior to using 
the Encoder.
   
   Given this, I suspect that this might be the only non-thread-safe Encoder 
usage in Spark (excluding code which is only used in Spark's unit tests). I 
don't think that we need to introduce similar copying in other public APIs.
   
   > I did some research about this and found some noticeable performance 
regression in our internal benchmark.
   
   What do you think about improving the performance / reducing the cost of 
`.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the 
immutable `vals` become fields of the case class, (b) the current constructor 
becomes a `.apply()` on the companion object and the case class constructor 
becomes `private`, and (c) `resolveAndBind` calls the companion object 
constructor instead of `copy()`? Given this, I think `copy()` could be _really_ 
cheap, effectively giving us a fresh copy of the internal mutable state but 
copying all other immutable attributes without performing any re-resolution, 
analysis, attribute binding, etc.
   
   If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one 
object allocation) and then could copy-by-default and free users from having to 
worry about thread-safety.
   
   I think that's a potentially huge win from a developer productivity 
point-of-view: the cost / toil of having to worry about thread-unsafe code is a 
tax placed on end users and creates a developer education / training burden, so 
I think it's worth attempting to eliminate this entire class of pitfall.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org