I ran into exactly this same problem of finding some accented characters getting replaced with "?" in a pipeline only when running on Dataflow and not when using the Direct Runner. KafkaIO was not involved, but I'd bet the root cause is the same.
In my case, the input turned out to be properly UTF-8 encoded and the problem was that we were calling String#getBytes() without specifying a charset. Locally, the default charset was UTF-8, but it looks like the Dataflow workers must have default charset set to something else (I suspect Windows-1252), so it was interpreting the UTF-8 bytes as Windows-1252 bytes for the byte arrays in our PCollection, and then they were being read back as UTF-8. We resolved the issue by combing our code for all uses of String#getBytes() and making sure we always pass in StandardCharsets.UTF_8. On Thu, Oct 31, 2019 at 5:26 AM Leonardo Campos < [email protected]> wrote: > Hello, > > Problem: Special characters such as öüä are being save to our sinks are > "?". > Set up: We read from Kafka using Kafka IO, run the Pipeline with DataFlow > Runner and save the results to BigQuery and ElasticSearch. > > We checked that data is being written to Kafka in UTF-8 (code check). We > checked also that the special characters appear using > kafka-console-consumer. > Something else is that in a local setup, with Kafka in docker* and using > Direct Runner, the character was correctly encoded. *the event was writen > using kafka-console-producer. > > Reading from Kafka: > > pipeline > .apply( > "ReadInput", > KafkaIO.<String, String>read() > .withBootstrapServers(...) > .withTopics(...) > .updateConsumerProperties(...) // only "group.id" and > "auto.offset.reset" > .withValueDeserializer(StringDeserializer.class) > .withCreateTime(Duration.standardMinutes(10)) > .commitOffsetsInFinalize()) > > So, any clues on where to investigate? In the mean time I'm going to add > more logging to the application to see if I can detect where the characters > get "lost" in the pipeline. Also try to write to local Kafka using a Java > Kafka Producer where I can be sure it is written in UTF-8. > > Thank you for the support. > >
