Strange 'gzip error' running Beam on Dataflow

Randal Moore Fri, 12 Oct 2018 14:41:06 -0700

Using Beam Java SDK 2.6.

I have a batch pipeline that has run successfully in its current several
times. Suddenly I am getting strange errors complaining about the format of
the input. As far as I know, the pipeline didn't change at all since the
last successful run. The error:
java.util.zip.ZipException: Not in GZIP format - Trace:
org.apache.beam.sdk.util.UserCodeException
indicates that something somewhere thinks the line of text is supposed to
be gzipped. I don't know what is setting that expectation nor what code is
thinking that it is supposed to be gzipped.


The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The
content of the bucket object is individual "text" lines (actually each line
is JSON encoded). This error is in the first doFn following the TextIO -
that  converts each string to an value object.

My log message in the exception handler shows the exact text for the string
that I am expecting. I tried logging the callstack to see where the GZIP
exception is thrown - turns out to be a bit hard to follow (with a bunch of
dataflow classes called at the line in the processElement method that first
uses the string).


   - Changing the lines to pure text, like "hello" and "world", gets to the
   JSON parser, which throws an error (since it isn't JSON any more).
   - If I base64 encode the lines, I [still] get the GZIP exception.
   - I was running an older version of Beam so I upgraded to 2.6. Didn't
   help
   - The bucket object uses *application/octet-encoding*
   - Tried changing the read from the bucket from the default to explicitly
   using uncompressed.
   TextIO.read.from(job.inputsPath).withCompression(Compression.UNCOMPRESSED
   )

One other details is that most of the code is written in Scala even though
it uses the Java SDK for Beam.

Any help appreciated!
rdm

Strange 'gzip error' running Beam on Dataflow

Reply via email to