Using Beam Java SDK 2.6.

I have a batch pipeline that has run successfully in its current several
times. Suddenly I am getting strange errors complaining about the format of
the input. As far as I know, the pipeline didn't change at all since the
last successful run. The error:
java.util.zip.ZipException: Not in GZIP format - Trace:
org.apache.beam.sdk.util.UserCodeException
indicates that something somewhere thinks the line of text is supposed to
be gzipped. I don't know what is setting that expectation nor what code is
thinking that it is supposed to be gzipped.

The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The
content of the bucket object is individual "text" lines (actually each line
is JSON encoded). This error is in the first doFn following the TextIO -
that  converts each string to an value object.

My log message in the exception handler shows the exact text for the string
that I am expecting. I tried logging the callstack to see where the GZIP
exception is thrown - turns out to be a bit hard to follow (with a bunch of
dataflow classes called at the line in the processElement method that first
uses the string).


   - Changing the lines to pure text, like "hello" and "world", gets to the
   JSON parser, which throws an error (since it isn't JSON any more).
   - If I base64 encode the lines, I [still] get the GZIP exception.
   - I was running an older version of Beam so I upgraded to 2.6. Didn't
   help
   - The bucket object uses *application/octet-encoding*
   - Tried changing the read from the bucket from the default to explicitly
   using uncompressed.
   TextIO.read.from(job.inputsPath).withCompression(Compression.UNCOMPRESSED
   )

One other details is that most of the code is written in Scala even though
it uses the Java SDK for Beam.

Any help appreciated!
rdm

Reply via email to