Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-13 Thread Romain Manni-Bucau
if you have a javaagent you can otherwise you can't but beam can proxy all
instances it sees which would be enough while the transforms* don't create
their own classloaders without reusing the TCCL. On the deserialization
side it is easy since it is beam land and it "just "needs to find back or
create the right classloader and when the last instance using this
classloader disappear it destroys it.

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le mer. 13 juin 2018 à 20:32, Lukasz Cwik  a écrit :

> I'm assuming that you have control over the application environment. Would
> it be possible to replace the ObjectInputStream that the JVM provides with
> your own version that uses the thread context class loader and manage the
> classloader per thread depending on what "application" owns that thread?
> (You would need to add a bunch of logic to correctly associate each
> created thread/thread factory with the correct application but most
> application containers already need to do this).
>
> On Wed, Jun 13, 2018 at 11:12 AM Romain Manni-Bucau 
> wrote:
>
>> (answered inline)
>>
>>
>> Le mer. 13 juin 2018 à 18:42, Lukasz Cwik  a écrit :
>>
>>> Thanks for the example Romain.
>>>
>>> I took a look through it and was wondering whether it is only the root
>>> objects in the deserialization tree that need to implement
>>> SerializableService?
>>> Do lots of things need to implement SerializableService typically?
>>>
>>
>> yes, on the one entering the (de)serialization process must handle that
>>
>>
>>> What do you do with types that you don't control (for example do you
>>> create wrapper types)?
>>>
>>
>> Like beam classes? ;)
>>
>> You can instrument their bytecode like in
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/transformer/BeamIOTransformer.java#L208
>> (sorry i dont use bytebuddy but directly asm). This is quite easy to do as
>> soon as you have a classloader for these classes or - if you reuse the jvm
>> classloader - a javaagent or equivalent.
>>
>> If you don't want/can't change the bytecode then you manipulate a proxy
>> instead of the real instance (a wrapper):
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/asm/ProxyGenerator.java#L118
>> .
>>
>>
>>>
>>> On Wed, Jun 6, 2018 at 9:56 PM Romain Manni-Bucau 
>>> wrote:
>>>
 Note sure the example is atomic enough but in
 https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/finder/StandaloneContainerFinder.java#L40
 the "instance()" is a singleton used by all the runtime of the framework.

 Deserialization happens in
 https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/serialization/SerializableService.java#L26
 and serialization is about creating this object in a write replace. Then
 the runtime is switching its classloader (runner for beam) as in
 https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/base/LifecycleImpl.java#L60
 asap and resets it once done to not break its environment for reused jvms
 case.

 If we take the case of an IO, the io would lazily creates its defined
 classloader from its spec and use some reference counting logic to destroy
 it when needed in its teardown for instance. The io then does the
 classloader switch in its callbacks (setup/teardown/process/bundle hooks
 etc).


 Le mer. 6 juin 2018 23:33, Lukasz Cwik  a écrit :

> Romain, can you point to an example of a global singleton registry
> that does this right for class loading (it may allow people to work 
> towards
> such an effort)?
>
> On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> It is actually very localised in runner code where beam should reset
>> the classloader when the deserialization happens and then the runner owns
>> the classloader all the way in evaluators.
>>
>> If IO change the classloader they must indeed handle it too and patch
>> the deserialization too.
>>
>> Here again (we mentionned it multiple times in other threads) beam
>> misses a global singleton registry where you can register a "service" to
>> look it up based of a serialization configuration and a lifecycle 
>> allowing
>> to close the classloader in all instances without hacks.
>>
>>
>> 

Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-13 Thread Lukasz Cwik
I'm assuming that you have control over the application environment. Would
it be possible to replace the ObjectInputStream that the JVM provides with
your own version that uses the thread context class loader and manage the
classloader per thread depending on what "application" owns that thread?
(You would need to add a bunch of logic to correctly associate each created
thread/thread factory with the correct application but most application
containers already need to do this).

On Wed, Jun 13, 2018 at 11:12 AM Romain Manni-Bucau 
wrote:

> (answered inline)
>
>
> Le mer. 13 juin 2018 à 18:42, Lukasz Cwik  a écrit :
>
>> Thanks for the example Romain.
>>
>> I took a look through it and was wondering whether it is only the root
>> objects in the deserialization tree that need to implement
>> SerializableService?
>> Do lots of things need to implement SerializableService typically?
>>
>
> yes, on the one entering the (de)serialization process must handle that
>
>
>> What do you do with types that you don't control (for example do you
>> create wrapper types)?
>>
>
> Like beam classes? ;)
>
> You can instrument their bytecode like in
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/transformer/BeamIOTransformer.java#L208
> (sorry i dont use bytebuddy but directly asm). This is quite easy to do as
> soon as you have a classloader for these classes or - if you reuse the jvm
> classloader - a javaagent or equivalent.
>
> If you don't want/can't change the bytecode then you manipulate a proxy
> instead of the real instance (a wrapper):
> https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/asm/ProxyGenerator.java#L118
> .
>
>
>>
>> On Wed, Jun 6, 2018 at 9:56 PM Romain Manni-Bucau 
>> wrote:
>>
>>> Note sure the example is atomic enough but in
>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/finder/StandaloneContainerFinder.java#L40
>>> the "instance()" is a singleton used by all the runtime of the framework.
>>>
>>> Deserialization happens in
>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/serialization/SerializableService.java#L26
>>> and serialization is about creating this object in a write replace. Then
>>> the runtime is switching its classloader (runner for beam) as in
>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/base/LifecycleImpl.java#L60
>>> asap and resets it once done to not break its environment for reused jvms
>>> case.
>>>
>>> If we take the case of an IO, the io would lazily creates its defined
>>> classloader from its spec and use some reference counting logic to destroy
>>> it when needed in its teardown for instance. The io then does the
>>> classloader switch in its callbacks (setup/teardown/process/bundle hooks
>>> etc).
>>>
>>>
>>> Le mer. 6 juin 2018 23:33, Lukasz Cwik  a écrit :
>>>
 Romain, can you point to an example of a global singleton registry that
 does this right for class loading (it may allow people to work towards such
 an effort)?

 On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

> It is actually very localised in runner code where beam should reset
> the classloader when the deserialization happens and then the runner owns
> the classloader all the way in evaluators.
>
> If IO change the classloader they must indeed handle it too and patch
> the deserialization too.
>
> Here again (we mentionned it multiple times in other threads) beam
> misses a global singleton registry where you can register a "service" to
> look it up based of a serialization configuration and a lifecycle allowing
> to close the classloader in all instances without hacks.
>
>
> Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :
>
>> Perhaps we can also adopt a practice of making our own APIs
>> explicitly pass a Classloader when appropriate so we only have to set 
>> this
>> when we are entering code that does not have good hygiene. It might
>> actually be nice to have a lightweight static analysis to forbid bad
>> methods in our code.
>>
>> Kenn
>>
>> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>>
>>> I totally agree, but there are so many Java APIs (including ours)
>>> that messed this up so everyone lives with the same hack.
>>>
>>> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
>>> wrote:
>>>
 It seems like a terribly fragile way to pass arguments but my
 tests pass when I wrap the JDBC path into Beam pipeline execution with 
 that
 pattern.


Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-13 Thread Romain Manni-Bucau
(answered inline)


Le mer. 13 juin 2018 à 18:42, Lukasz Cwik  a écrit :

> Thanks for the example Romain.
>
> I took a look through it and was wondering whether it is only the root
> objects in the deserialization tree that need to implement
> SerializableService?
> Do lots of things need to implement SerializableService typically?
>

yes, on the one entering the (de)serialization process must handle that


> What do you do with types that you don't control (for example do you
> create wrapper types)?
>

Like beam classes? ;)

You can instrument their bytecode like in
https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/transformer/BeamIOTransformer.java#L208
(sorry i dont use bytebuddy but directly asm). This is quite easy to do as
soon as you have a classloader for these classes or - if you reuse the jvm
classloader - a javaagent or equivalent.

If you don't want/can't change the bytecode then you manipulate a proxy
instead of the real instance (a wrapper):
https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/asm/ProxyGenerator.java#L118
.


>
> On Wed, Jun 6, 2018 at 9:56 PM Romain Manni-Bucau 
> wrote:
>
>> Note sure the example is atomic enough but in
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/finder/StandaloneContainerFinder.java#L40
>> the "instance()" is a singleton used by all the runtime of the framework.
>>
>> Deserialization happens in
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/serialization/SerializableService.java#L26
>> and serialization is about creating this object in a write replace. Then
>> the runtime is switching its classloader (runner for beam) as in
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/base/LifecycleImpl.java#L60
>> asap and resets it once done to not break its environment for reused jvms
>> case.
>>
>> If we take the case of an IO, the io would lazily creates its defined
>> classloader from its spec and use some reference counting logic to destroy
>> it when needed in its teardown for instance. The io then does the
>> classloader switch in its callbacks (setup/teardown/process/bundle hooks
>> etc).
>>
>>
>> Le mer. 6 juin 2018 23:33, Lukasz Cwik  a écrit :
>>
>>> Romain, can you point to an example of a global singleton registry that
>>> does this right for class loading (it may allow people to work towards such
>>> an effort)?
>>>
>>> On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 It is actually very localised in runner code where beam should reset
 the classloader when the deserialization happens and then the runner owns
 the classloader all the way in evaluators.

 If IO change the classloader they must indeed handle it too and patch
 the deserialization too.

 Here again (we mentionned it multiple times in other threads) beam
 misses a global singleton registry where you can register a "service" to
 look it up based of a serialization configuration and a lifecycle allowing
 to close the classloader in all instances without hacks.


 Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :

> Perhaps we can also adopt a practice of making our own APIs explicitly
> pass a Classloader when appropriate so we only have to set this when we 
> are
> entering code that does not have good hygiene. It might actually be nice 
> to
> have a lightweight static analysis to forbid bad methods in our code.
>
> Kenn
>
> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>
>> I totally agree, but there are so many Java APIs (including ours)
>> that messed this up so everyone lives with the same hack.
>>
>> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
>> wrote:
>>
>>> It seems like a terribly fragile way to pass arguments but my tests
>>> pass when I wrap the JDBC path into Beam pipeline execution with that
>>> pattern.
>>>
>>> Thanks!
>>>
>>> Andrew
>>>
>>> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>>>
 It is a common mistake for APIs to not include a way to specify
 which class loader to use when doing something like deserializing an
 instance of a class via the ObjectInputStream. This common issue also
 affects Apache Beam (SerializableCoder, PipelineOptionsFactory, ...) 
 and
 the way that typical Java APIs have gotten around this is to use to the
 thread context class loader (TCCL) as the way to plumb this additional
 attribute through. So Apache Beam is meant to in all places 

Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-13 Thread Lukasz Cwik
Thanks for the example Romain.

I took a look through it and was wondering whether it is only the root
objects in the deserialization tree that need to implement
SerializableService?
Do lots of things need to implement SerializableService typically?
What do you do with types that you don't control (for example do you create
wrapper types)?

On Wed, Jun 6, 2018 at 9:56 PM Romain Manni-Bucau 
wrote:

> Note sure the example is atomic enough but in
> https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/finder/StandaloneContainerFinder.java#L40
> the "instance()" is a singleton used by all the runtime of the framework.
>
> Deserialization happens in
> https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/serialization/SerializableService.java#L26
> and serialization is about creating this object in a write replace. Then
> the runtime is switching its classloader (runner for beam) as in
> https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/base/LifecycleImpl.java#L60
> asap and resets it once done to not break its environment for reused jvms
> case.
>
> If we take the case of an IO, the io would lazily creates its defined
> classloader from its spec and use some reference counting logic to destroy
> it when needed in its teardown for instance. The io then does the
> classloader switch in its callbacks (setup/teardown/process/bundle hooks
> etc).
>
>
> Le mer. 6 juin 2018 23:33, Lukasz Cwik  a écrit :
>
>> Romain, can you point to an example of a global singleton registry that
>> does this right for class loading (it may allow people to work towards such
>> an effort)?
>>
>> On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau 
>> wrote:
>>
>>> It is actually very localised in runner code where beam should reset the
>>> classloader when the deserialization happens and then the runner owns the
>>> classloader all the way in evaluators.
>>>
>>> If IO change the classloader they must indeed handle it too and patch
>>> the deserialization too.
>>>
>>> Here again (we mentionned it multiple times in other threads) beam
>>> misses a global singleton registry where you can register a "service" to
>>> look it up based of a serialization configuration and a lifecycle allowing
>>> to close the classloader in all instances without hacks.
>>>
>>>
>>> Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :
>>>
 Perhaps we can also adopt a practice of making our own APIs explicitly
 pass a Classloader when appropriate so we only have to set this when we are
 entering code that does not have good hygiene. It might actually be nice to
 have a lightweight static analysis to forbid bad methods in our code.

 Kenn

 On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:

> I totally agree, but there are so many Java APIs (including ours) that
> messed this up so everyone lives with the same hack.
>
> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
> wrote:
>
>> It seems like a terribly fragile way to pass arguments but my tests
>> pass when I wrap the JDBC path into Beam pipeline execution with that
>> pattern.
>>
>> Thanks!
>>
>> Andrew
>>
>> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>>
>>> It is a common mistake for APIs to not include a way to specify
>>> which class loader to use when doing something like deserializing an
>>> instance of a class via the ObjectInputStream. This common issue also
>>> affects Apache Beam (SerializableCoder, PipelineOptionsFactory, ...) and
>>> the way that typical Java APIs have gotten around this is to use to the
>>> thread context class loader (TCCL) as the way to plumb this additional
>>> attribute through. So Apache Beam is meant to in all places honor the 
>>> TCCL
>>> if it has been set as most Java libraries (not all) do the same hack.
>>>
>>> In most environments the TCCL is not set and we are working with a
>>> single class loader. It turns out that in more complicated environments
>>> (like when loading a JDBC driver, or JNDI, or an application server, 
>>> ...)
>>> this usually doesn't work without each caller knowing what class loading
>>> context they should be in. A common work around for most scenarios is to
>>> always set the TCCL to the current classes class loader like so before
>>> invoking any APIs that do class loading so you don't propagate the TCCL 
>>> of
>>> the caller along since they may have set it for some other reason:
>>>
>>> ClassLoader originalClassLoader = 
>>> Thread.currentThread().getContextClassLoader();try {
>>> 
>>> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
>>> // call some API that uses 

Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-06 Thread Romain Manni-Bucau
Note sure the example is atomic enough but in
https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/finder/StandaloneContainerFinder.java#L40
the "instance()" is a singleton used by all the runtime of the framework.

Deserialization happens in
https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/serialization/SerializableService.java#L26
and serialization is about creating this object in a write replace. Then
the runtime is switching its classloader (runner for beam) as in
https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/base/LifecycleImpl.java#L60
asap and resets it once done to not break its environment for reused jvms
case.

If we take the case of an IO, the io would lazily creates its defined
classloader from its spec and use some reference counting logic to destroy
it when needed in its teardown for instance. The io then does the
classloader switch in its callbacks (setup/teardown/process/bundle hooks
etc).


Le mer. 6 juin 2018 23:33, Lukasz Cwik  a écrit :

> Romain, can you point to an example of a global singleton registry that
> does this right for class loading (it may allow people to work towards such
> an effort)?
>
> On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau 
> wrote:
>
>> It is actually very localised in runner code where beam should reset the
>> classloader when the deserialization happens and then the runner owns the
>> classloader all the way in evaluators.
>>
>> If IO change the classloader they must indeed handle it too and patch the
>> deserialization too.
>>
>> Here again (we mentionned it multiple times in other threads) beam misses
>> a global singleton registry where you can register a "service" to look it
>> up based of a serialization configuration and a lifecycle allowing to close
>> the classloader in all instances without hacks.
>>
>>
>> Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :
>>
>>> Perhaps we can also adopt a practice of making our own APIs explicitly
>>> pass a Classloader when appropriate so we only have to set this when we are
>>> entering code that does not have good hygiene. It might actually be nice to
>>> have a lightweight static analysis to forbid bad methods in our code.
>>>
>>> Kenn
>>>
>>> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>>>
 I totally agree, but there are so many Java APIs (including ours) that
 messed this up so everyone lives with the same hack.

 On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
 wrote:

> It seems like a terribly fragile way to pass arguments but my tests
> pass when I wrap the JDBC path into Beam pipeline execution with that
> pattern.
>
> Thanks!
>
> Andrew
>
> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>
>> It is a common mistake for APIs to not include a way to specify which
>> class loader to use when doing something like deserializing an instance 
>> of
>> a class via the ObjectInputStream. This common issue also affects Apache
>> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
>> typical Java APIs have gotten around this is to use to the thread context
>> class loader (TCCL) as the way to plumb this additional attribute 
>> through.
>> So Apache Beam is meant to in all places honor the TCCL if it has been 
>> set
>> as most Java libraries (not all) do the same hack.
>>
>> In most environments the TCCL is not set and we are working with a
>> single class loader. It turns out that in more complicated environments
>> (like when loading a JDBC driver, or JNDI, or an application server, ...)
>> this usually doesn't work without each caller knowing what class loading
>> context they should be in. A common work around for most scenarios is to
>> always set the TCCL to the current classes class loader like so before
>> invoking any APIs that do class loading so you don't propagate the TCCL 
>> of
>> the caller along since they may have set it for some other reason:
>>
>> ClassLoader originalClassLoader = 
>> Thread.currentThread().getContextClassLoader();try {
>> 
>> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
>> // call some API that uses reflection without taking ClassLoader 
>> param} finally {
>> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>>
>>
>>
>> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
>> wrote:
>>
>>> I'm having class loading issues that go away when I revert the
>>> changes in our use of Class.forName added in
>>> https://github.com/apache/beam/pull/4674. The problem I'm having is
>>> that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) 

Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-06 Thread Lukasz Cwik
Romain, can you point to an example of a global singleton registry that
does this right for class loading (it may allow people to work towards such
an effort)?

On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau 
wrote:

> It is actually very localised in runner code where beam should reset the
> classloader when the deserialization happens and then the runner owns the
> classloader all the way in evaluators.
>
> If IO change the classloader they must indeed handle it too and patch the
> deserialization too.
>
> Here again (we mentionned it multiple times in other threads) beam misses
> a global singleton registry where you can register a "service" to look it
> up based of a serialization configuration and a lifecycle allowing to close
> the classloader in all instances without hacks.
>
>
> Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :
>
>> Perhaps we can also adopt a practice of making our own APIs explicitly
>> pass a Classloader when appropriate so we only have to set this when we are
>> entering code that does not have good hygiene. It might actually be nice to
>> have a lightweight static analysis to forbid bad methods in our code.
>>
>> Kenn
>>
>> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>>
>>> I totally agree, but there are so many Java APIs (including ours) that
>>> messed this up so everyone lives with the same hack.
>>>
>>> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
>>> wrote:
>>>
 It seems like a terribly fragile way to pass arguments but my tests
 pass when I wrap the JDBC path into Beam pipeline execution with that
 pattern.

 Thanks!

 Andrew

 On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:

> It is a common mistake for APIs to not include a way to specify which
> class loader to use when doing something like deserializing an instance of
> a class via the ObjectInputStream. This common issue also affects Apache
> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
> typical Java APIs have gotten around this is to use to the thread context
> class loader (TCCL) as the way to plumb this additional attribute through.
> So Apache Beam is meant to in all places honor the TCCL if it has been set
> as most Java libraries (not all) do the same hack.
>
> In most environments the TCCL is not set and we are working with a
> single class loader. It turns out that in more complicated environments
> (like when loading a JDBC driver, or JNDI, or an application server, ...)
> this usually doesn't work without each caller knowing what class loading
> context they should be in. A common work around for most scenarios is to
> always set the TCCL to the current classes class loader like so before
> invoking any APIs that do class loading so you don't propagate the TCCL of
> the caller along since they may have set it for some other reason:
>
> ClassLoader originalClassLoader = 
> Thread.currentThread().getContextClassLoader();try {
> 
> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
> // call some API that uses reflection without taking ClassLoader 
> param} finally {
> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>
>
>
> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
> wrote:
>
>> I'm having class loading issues that go away when I revert the
>> changes in our use of Class.forName added in
>> https://github.com/apache/beam/pull/4674. The problem I'm having is
>> that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creates an
>> isolated class loader to load our library. Things work if we call
>> Class.forName with the default class loader [getClass().getClassLoader() 
>> or
>> no argument] but not if we use the thread context class loader
>> [Thread.currentThread().getContextClassLoader() or
>> ReflectHelpers.findClassLoader()]. Why is using the default class loader
>> not the right thing to do? How can I fix this problem?
>>
>> See this integration test for an example:
>> https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
>>
>> https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k
>>
>> Andrew
>>
>


Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-05 Thread Romain Manni-Bucau
It is actually very localised in runner code where beam should reset the
classloader when the deserialization happens and then the runner owns the
classloader all the way in evaluators.

If IO change the classloader they must indeed handle it too and patch the
deserialization too.

Here again (we mentionned it multiple times in other threads) beam misses a
global singleton registry where you can register a "service" to look it up
based of a serialization configuration and a lifecycle allowing to close
the classloader in all instances without hacks.


Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :

> Perhaps we can also adopt a practice of making our own APIs explicitly
> pass a Classloader when appropriate so we only have to set this when we are
> entering code that does not have good hygiene. It might actually be nice to
> have a lightweight static analysis to forbid bad methods in our code.
>
> Kenn
>
> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>
>> I totally agree, but there are so many Java APIs (including ours) that
>> messed this up so everyone lives with the same hack.
>>
>> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
>> wrote:
>>
>>> It seems like a terribly fragile way to pass arguments but my tests
>>> pass when I wrap the JDBC path into Beam pipeline execution with that
>>> pattern.
>>>
>>> Thanks!
>>>
>>> Andrew
>>>
>>> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>>>
 It is a common mistake for APIs to not include a way to specify which
 class loader to use when doing something like deserializing an instance of
 a class via the ObjectInputStream. This common issue also affects Apache
 Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
 typical Java APIs have gotten around this is to use to the thread context
 class loader (TCCL) as the way to plumb this additional attribute through.
 So Apache Beam is meant to in all places honor the TCCL if it has been set
 as most Java libraries (not all) do the same hack.

 In most environments the TCCL is not set and we are working with a
 single class loader. It turns out that in more complicated environments
 (like when loading a JDBC driver, or JNDI, or an application server, ...)
 this usually doesn't work without each caller knowing what class loading
 context they should be in. A common work around for most scenarios is to
 always set the TCCL to the current classes class loader like so before
 invoking any APIs that do class loading so you don't propagate the TCCL of
 the caller along since they may have set it for some other reason:

 ClassLoader originalClassLoader = 
 Thread.currentThread().getContextClassLoader();try {
 
 Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
 // call some API that uses reflection without taking ClassLoader 
 param} finally {
 Thread.currentThread().setContextClassLoader(originalClassLoader);}



 On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
 wrote:

> I'm having class loading issues that go away when I revert the changes
> in our use of Class.forName added in
> https://github.com/apache/beam/pull/4674. The problem I'm having is
> that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creates an
> isolated class loader to load our library. Things work if we call
> Class.forName with the default class loader [getClass().getClassLoader() 
> or
> no argument] but not if we use the thread context class loader
> [Thread.currentThread().getContextClassLoader() or
> ReflectHelpers.findClassLoader()]. Why is using the default class loader
> not the right thing to do? How can I fix this problem?
>
> See this integration test for an example:
> https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
>
> https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k
>
> Andrew
>



Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-05 Thread Kenneth Knowles
Perhaps we can also adopt a practice of making our own APIs explicitly pass
a Classloader when appropriate so we only have to set this when we are
entering code that does not have good hygiene. It might actually be nice to
have a lightweight static analysis to forbid bad methods in our code.

Kenn

On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:

> I totally agree, but there are so many Java APIs (including ours) that
> messed this up so everyone lives with the same hack.
>
> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud  wrote:
>
>> It seems like a terribly fragile way to pass arguments but my tests pass
>> when I wrap the JDBC path into Beam pipeline execution with that pattern.
>>
>> Thanks!
>>
>> Andrew
>>
>> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>>
>>> It is a common mistake for APIs to not include a way to specify which
>>> class loader to use when doing something like deserializing an instance of
>>> a class via the ObjectInputStream. This common issue also affects Apache
>>> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
>>> typical Java APIs have gotten around this is to use to the thread context
>>> class loader (TCCL) as the way to plumb this additional attribute through.
>>> So Apache Beam is meant to in all places honor the TCCL if it has been set
>>> as most Java libraries (not all) do the same hack.
>>>
>>> In most environments the TCCL is not set and we are working with a
>>> single class loader. It turns out that in more complicated environments
>>> (like when loading a JDBC driver, or JNDI, or an application server, ...)
>>> this usually doesn't work without each caller knowing what class loading
>>> context they should be in. A common work around for most scenarios is to
>>> always set the TCCL to the current classes class loader like so before
>>> invoking any APIs that do class loading so you don't propagate the TCCL of
>>> the caller along since they may have set it for some other reason:
>>>
>>> ClassLoader originalClassLoader = 
>>> Thread.currentThread().getContextClassLoader();try {
>>> 
>>> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
>>> // call some API that uses reflection without taking ClassLoader param} 
>>> finally {
>>> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>>>
>>>
>>>
>>> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
>>> wrote:
>>>
 I'm having class loading issues that go away when I revert the changes
 in our use of Class.forName added in
 https://github.com/apache/beam/pull/4674. The problem I'm having is
 that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creates an
 isolated class loader to load our library. Things work if we call
 Class.forName with the default class loader [getClass().getClassLoader() or
 no argument] but not if we use the thread context class loader
 [Thread.currentThread().getContextClassLoader() or
 ReflectHelpers.findClassLoader()]. Why is using the default class loader
 not the right thing to do? How can I fix this problem?

 See this integration test for an example:
 https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44

 https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k

 Andrew

>>>


Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-04 Thread Lukasz Cwik
I totally agree, but there are so many Java APIs (including ours) that
messed this up so everyone lives with the same hack.

On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud  wrote:

> It seems like a terribly fragile way to pass arguments but my tests pass
> when I wrap the JDBC path into Beam pipeline execution with that pattern.
>
> Thanks!
>
> Andrew
>
> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>
>> It is a common mistake for APIs to not include a way to specify which
>> class loader to use when doing something like deserializing an instance of
>> a class via the ObjectInputStream. This common issue also affects Apache
>> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
>> typical Java APIs have gotten around this is to use to the thread context
>> class loader (TCCL) as the way to plumb this additional attribute through.
>> So Apache Beam is meant to in all places honor the TCCL if it has been set
>> as most Java libraries (not all) do the same hack.
>>
>> In most environments the TCCL is not set and we are working with a single
>> class loader. It turns out that in more complicated environments (like when
>> loading a JDBC driver, or JNDI, or an application server, ...) this usually
>> doesn't work without each caller knowing what class loading context they
>> should be in. A common work around for most scenarios is to always set the
>> TCCL to the current classes class loader like so before invoking any APIs
>> that do class loading so you don't propagate the TCCL of the caller along
>> since they may have set it for some other reason:
>>
>> ClassLoader originalClassLoader = 
>> Thread.currentThread().getContextClassLoader();try {
>> 
>> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
>> // call some API that uses reflection without taking ClassLoader param} 
>> finally {
>> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>>
>>
>>
>> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
>> wrote:
>>
>>> I'm having class loading issues that go away when I revert the changes
>>> in our use of Class.forName added in
>>> https://github.com/apache/beam/pull/4674. The problem I'm having is
>>> that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creates an
>>> isolated class loader to load our library. Things work if we call
>>> Class.forName with the default class loader [getClass().getClassLoader() or
>>> no argument] but not if we use the thread context class loader
>>> [Thread.currentThread().getContextClassLoader() or
>>> ReflectHelpers.findClassLoader()]. Why is using the default class loader
>>> not the right thing to do? How can I fix this problem?
>>>
>>> See this integration test for an example:
>>> https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
>>>
>>> https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k
>>>
>>> Andrew
>>>
>>


Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-04 Thread Andrew Pilloud
It seems like a terribly fragile way to pass arguments but my tests pass
when I wrap the JDBC path into Beam pipeline execution with that pattern.

Thanks!

Andrew

On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:

> It is a common mistake for APIs to not include a way to specify which
> class loader to use when doing something like deserializing an instance of
> a class via the ObjectInputStream. This common issue also affects Apache
> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
> typical Java APIs have gotten around this is to use to the thread context
> class loader (TCCL) as the way to plumb this additional attribute through.
> So Apache Beam is meant to in all places honor the TCCL if it has been set
> as most Java libraries (not all) do the same hack.
>
> In most environments the TCCL is not set and we are working with a single
> class loader. It turns out that in more complicated environments (like when
> loading a JDBC driver, or JNDI, or an application server, ...) this usually
> doesn't work without each caller knowing what class loading context they
> should be in. A common work around for most scenarios is to always set the
> TCCL to the current classes class loader like so before invoking any APIs
> that do class loading so you don't propagate the TCCL of the caller along
> since they may have set it for some other reason:
>
> ClassLoader originalClassLoader = 
> Thread.currentThread().getContextClassLoader();try {
> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
> // call some API that uses reflection without taking ClassLoader param} 
> finally {
> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>
>
>
> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud  wrote:
>
>> I'm having class loading issues that go away when I revert the changes in
>> our use of Class.forName added in
>> https://github.com/apache/beam/pull/4674. The problem I'm having is that
>> the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creates an isolated
>> class loader to load our library. Things work if we call Class.forName with
>> the default class loader [getClass().getClassLoader() or no argument] but
>> not if we use the thread context class loader
>> [Thread.currentThread().getContextClassLoader() or
>> ReflectHelpers.findClassLoader()]. Why is using the default class loader
>> not the right thing to do? How can I fix this problem?
>>
>> See this integration test for an example:
>> https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
>> https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k
>>
>> Andrew
>>
>


Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-04 Thread Lukasz Cwik
It is a common mistake for APIs to not include a way to specify which class
loader to use when doing something like deserializing an instance of a
class via the ObjectInputStream. This common issue also affects Apache Beam
(SerializableCoder, PipelineOptionsFactory, ...) and the way that typical
Java APIs have gotten around this is to use to the thread context class
loader (TCCL) as the way to plumb this additional attribute through. So
Apache Beam is meant to in all places honor the TCCL if it has been set as
most Java libraries (not all) do the same hack.

In most environments the TCCL is not set and we are working with a single
class loader. It turns out that in more complicated environments (like when
loading a JDBC driver, or JNDI, or an application server, ...) this usually
doesn't work without each caller knowing what class loading context they
should be in. A common work around for most scenarios is to always set the
TCCL to the current classes class loader like so before invoking any APIs
that do class loading so you don't propagate the TCCL of the caller along
since they may have set it for some other reason:

ClassLoader originalClassLoader =
Thread.currentThread().getContextClassLoader();try {
Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
// call some API that uses reflection without taking ClassLoader
param} finally {
Thread.currentThread().setContextClassLoader(originalClassLoader);}



On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud  wrote:

> I'm having class loading issues that go away when I revert the changes in
> our use of Class.forName added in https://github.com/apache/beam/pull/4674.
> The problem I'm having is that the typical JDBC GUI
> (SqlWorkbench/J, SQuirreL SQL) creates an isolated class loader to load our
> library. Things work if we call Class.forName with the default class loader
> [getClass().getClassLoader() or no argument] but not if we use the thread
> context class loader [Thread.currentThread().getContextClassLoader() or
> ReflectHelpers.findClassLoader()]. Why is using the default class loader
> not the right thing to do? How can I fix this problem?
>
> See this integration test for an example:
> https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
> https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k
>
> Andrew
>


Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-04 Thread Andrew Pilloud
I'm having class loading issues that go away when I revert the changes in
our use of Class.forName added in https://github.com/apache/beam/pull/4674.
The problem I'm having is that the typical JDBC GUI
(SqlWorkbench/J, SQuirreL SQL) creates an isolated class loader to load our
library. Things work if we call Class.forName with the default class loader
[getClass().getClassLoader() or no argument] but not if we use the thread
context class loader [Thread.currentThread().getContextClassLoader() or
ReflectHelpers.findClassLoader()]. Why is using the default class loader
not the right thing to do? How can I fix this problem?

See this integration test for an example:
https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k

Andrew