Thanks, Eddie.

Just to add to the discussion, I logged the following information:
Charset.defaultCharset(): US-ASCII
System.getProperty("file.encoding"): ANSI_X3.4-1968
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream()); writer..getEncoding(): ASCII

In our case, a Json library seems to be messing things up, as just on first glance I already found in its internals a string.getBytes() without the possibility to inform the encoding.

I really wonder if there is any way to change this default in DataFlow.

Cheers

On 04.11.2019 09:58, Eddy G wrote:
Adding to what Jeff just pointed out previously I'm dealing with the
same issue writing Parquet files using the ParquetIO module in
Dataflow and same stuff happens, even forcing all String objects with
UTF-8. Maybe it is related to behind the scenes decoding/encoding
within the previously mentioned module which causes those chars to be
wrongly encoded in the output, just in case you are doing some Parquet
processing or using any other module in the end which may have a
similar behavior.

Reply via email to