There is a way to run arbitrary code on JVM startup via a JVM
initializer[1] in the Dataflow worker and in the portable Java worker as
well.

You should be able to mutate system properties at that point in time since
Java allows for system properties to be mutated. The standard Java runtime
doesn't provide hooks to edit the environment variables and you have to
resort to some hackery that is JVM version dependent[2].

1:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
2: https://blog.sebastian-daschner.com/entries/changing_env_java

On Fri, Aug 30, 2019 at 8:13 AM Jeff Klukas <[email protected]> wrote:

> I just spent the past two days debugging a character corruption issue in a
> Dataflow pipeline. It turned out that we had encoded a json object to a
> string and then called getBytes() without specifying a charset. In our
> testing infrastructure, this didn't cause a problem because the default
> charset on the system was UTF-8. Whatever the default charset is on
> Dataflow workers, it is apparently not UTF-8.
>
> The main lesson here is to be very careful about always specifying a
> charset when encoding and decoding strings. But, it would be nice to
> protect ourselves from this problem in the future.
>
> Is there any way for users to specify environment variables and/or Java
> system properties when deploying a pipeline to Dataflow such that those
> settings are in effect on all workers? I'd like to ensure UTF-8 is the
> default charset throughout the pipeline on any system.
>
>

Reply via email to