The files have no content-encoding set. They are no big query exports but rather crafted by a service of mine.
Note that my doFunc gets called for each line of the file, something that I don't think would happen - wouldn't it apply gunzip to the whole content? On Fri, Oct 12, 2018, 5:04 PM Jose Ignacio Honrado <[email protected]> wrote: > Hi Randal, > > You might be experiencing the automatic decompressive transcoding from > GCS. Take a look at this to see if it helps: > https://cloud.google.com/storage/docs/transcoding > > It seems like a compressed file is expected (as for the gz extension), but > the file is returned decompressed by GCS. > > Any change these files in GCS are exported from BigQuery? I started to > "suffer" a similar issue cause the exports from BQ tables to GCS started > setting new metadata (content-encoding: gzip, content-type: text/csv) to > the output files and, as consequence, GZIP files were automatically > decompressed when downloading them (as explained in the previous link). > > Best, > > > El vie., 12 oct. 2018 23:40, Randal Moore <[email protected]> escribió: > >> Using Beam Java SDK 2.6. >> >> I have a batch pipeline that has run successfully in its current several >> times. Suddenly I am getting strange errors complaining about the format of >> the input. As far as I know, the pipeline didn't change at all since the >> last successful run. The error: >> java.util.zip.ZipException: Not in GZIP format - Trace: >> org.apache.beam.sdk.util.UserCodeException >> indicates that something somewhere thinks the line of text is supposed to >> be gzipped. I don't know what is setting that expectation nor what code is >> thinking that it is supposed to be gzipped. >> >> The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The >> content of the bucket object is individual "text" lines (actually each line >> is JSON encoded). This error is in the first doFn following the TextIO - >> that converts each string to an value object. >> >> My log message in the exception handler shows the exact text for the >> string that I am expecting. I tried logging the callstack to see where the >> GZIP exception is thrown - turns out to be a bit hard to follow (with a >> bunch of dataflow classes called at the line in the processElement method >> that first uses the string). >> >> >> - Changing the lines to pure text, like "hello" and "world", gets to >> the JSON parser, which throws an error (since it isn't JSON any more). >> - If I base64 encode the lines, I [still] get the GZIP exception. >> - I was running an older version of Beam so I upgraded to 2.6. Didn't >> help >> - The bucket object uses *application/octet-encoding* >> - Tried changing the read from the bucket from the default to >> explicitly using uncompressed. >> TextIO.read.from(job.inputsPath).withCompression(Compression. >> UNCOMPRESSED) >> >> One other details is that most of the code is written in Scala even >> though it uses the Java SDK for Beam. >> >> Any help appreciated! >> rdm >> >> >>
