Using Beam Java SDK 2.6. I have a batch pipeline that has run successfully in its current several times. Suddenly I am getting strange errors complaining about the format of the input. As far as I know, the pipeline didn't change at all since the last successful run. The error: java.util.zip.ZipException: Not in GZIP format - Trace: org.apache.beam.sdk.util.UserCodeException indicates that something somewhere thinks the line of text is supposed to be gzipped. I don't know what is setting that expectation nor what code is thinking that it is supposed to be gzipped.
The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The content of the bucket object is individual "text" lines (actually each line is JSON encoded). This error is in the first doFn following the TextIO - that converts each string to an value object. My log message in the exception handler shows the exact text for the string that I am expecting. I tried logging the callstack to see where the GZIP exception is thrown - turns out to be a bit hard to follow (with a bunch of dataflow classes called at the line in the processElement method that first uses the string). - Changing the lines to pure text, like "hello" and "world", gets to the JSON parser, which throws an error (since it isn't JSON any more). - If I base64 encode the lines, I [still] get the GZIP exception. - I was running an older version of Beam so I upgraded to 2.6. Didn't help - The bucket object uses *application/octet-encoding* - Tried changing the read from the bucket from the default to explicitly using uncompressed. TextIO.read.from(job.inputsPath).withCompression(Compression.UNCOMPRESSED ) One other details is that most of the code is written in Scala even though it uses the Java SDK for Beam. Any help appreciated! rdm