Re: Strange 'gzip error' running Beam on Dataflow
The files have no content-encoding set. They are no big query exports but rather crafted by a service of mine. Note that my doFunc gets called for each line of the file, something that I don't think would happen - wouldn't it apply gunzip to the whole content? On Fri, Oct 12, 2018, 5:04 PM Jose Ignacio Honrado wrote: > Hi Randal, > > You might be experiencing the automatic decompressive transcoding from > GCS. Take a look at this to see if it helps: > https://cloud.google.com/storage/docs/transcoding > > It seems like a compressed file is expected (as for the gz extension), but > the file is returned decompressed by GCS. > > Any change these files in GCS are exported from BigQuery? I started to > "suffer" a similar issue cause the exports from BQ tables to GCS started > setting new metadata (content-encoding: gzip, content-type: text/csv) to > the output files and, as consequence, GZIP files were automatically > decompressed when downloading them (as explained in the previous link). > > Best, > > > El vie., 12 oct. 2018 23:40, Randal Moore escribió: > >> Using Beam Java SDK 2.6. >> >> I have a batch pipeline that has run successfully in its current several >> times. Suddenly I am getting strange errors complaining about the format of >> the input. As far as I know, the pipeline didn't change at all since the >> last successful run. The error: >> java.util.zip.ZipException: Not in GZIP format - Trace: >> org.apache.beam.sdk.util.UserCodeException >> indicates that something somewhere thinks the line of text is supposed to >> be gzipped. I don't know what is setting that expectation nor what code is >> thinking that it is supposed to be gzipped. >> >> The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The >> content of the bucket object is individual "text" lines (actually each line >> is JSON encoded). This error is in the first doFn following the TextIO - >> that converts each string to an value object. >> >> My log message in the exception handler shows the exact text for the >> string that I am expecting. I tried logging the callstack to see where the >> GZIP exception is thrown - turns out to be a bit hard to follow (with a >> bunch of dataflow classes called at the line in the processElement method >> that first uses the string). >> >> >>- Changing the lines to pure text, like "hello" and "world", gets to >>the JSON parser, which throws an error (since it isn't JSON any more). >>- If I base64 encode the lines, I [still] get the GZIP exception. >>- I was running an older version of Beam so I upgraded to 2.6. Didn't >>help >>- The bucket object uses *application/octet-encoding* >>- Tried changing the read from the bucket from the default to >>explicitly using uncompressed. >>TextIO.read.from(job.inputsPath).withCompression(Compression. >>UNCOMPRESSED) >> >> One other details is that most of the code is written in Scala even >> though it uses the Java SDK for Beam. >> >> Any help appreciated! >> rdm >> >> >>
Re: Strange 'gzip error' running Beam on Dataflow
Hi Randal, You might be experiencing the automatic decompressive transcoding from GCS. Take a look at this to see if it helps: https://cloud.google.com/storage/docs/transcoding It seems like a compressed file is expected (as for the gz extension), but the file is returned decompressed by GCS. Any change these files in GCS are exported from BigQuery? I started to "suffer" a similar issue cause the exports from BQ tables to GCS started setting new metadata (content-encoding: gzip, content-type: text/csv) to the output files and, as consequence, GZIP files were automatically decompressed when downloading them (as explained in the previous link). Best, El vie., 12 oct. 2018 23:40, Randal Moore escribió: > Using Beam Java SDK 2.6. > > I have a batch pipeline that has run successfully in its current several > times. Suddenly I am getting strange errors complaining about the format of > the input. As far as I know, the pipeline didn't change at all since the > last successful run. The error: > java.util.zip.ZipException: Not in GZIP format - Trace: > org.apache.beam.sdk.util.UserCodeException > indicates that something somewhere thinks the line of text is supposed to > be gzipped. I don't know what is setting that expectation nor what code is > thinking that it is supposed to be gzipped. > > The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The > content of the bucket object is individual "text" lines (actually each line > is JSON encoded). This error is in the first doFn following the TextIO - > that converts each string to an value object. > > My log message in the exception handler shows the exact text for the > string that I am expecting. I tried logging the callstack to see where the > GZIP exception is thrown - turns out to be a bit hard to follow (with a > bunch of dataflow classes called at the line in the processElement method > that first uses the string). > > >- Changing the lines to pure text, like "hello" and "world", gets to >the JSON parser, which throws an error (since it isn't JSON any more). >- If I base64 encode the lines, I [still] get the GZIP exception. >- I was running an older version of Beam so I upgraded to 2.6. Didn't >help >- The bucket object uses *application/octet-encoding* >- Tried changing the read from the bucket from the default to >explicitly using uncompressed. >TextIO.read.from(job.inputsPath).withCompression(Compression. >UNCOMPRESSED) > > One other details is that most of the code is written in Scala even though > it uses the Java SDK for Beam. > > Any help appreciated! > rdm > > >
Strange 'gzip error' running Beam on Dataflow
Using Beam Java SDK 2.6. I have a batch pipeline that has run successfully in its current several times. Suddenly I am getting strange errors complaining about the format of the input. As far as I know, the pipeline didn't change at all since the last successful run. The error: java.util.zip.ZipException: Not in GZIP format - Trace: org.apache.beam.sdk.util.UserCodeException indicates that something somewhere thinks the line of text is supposed to be gzipped. I don't know what is setting that expectation nor what code is thinking that it is supposed to be gzipped. The pipeline uses TextIO to read from a Google Cloud Storage Bucket. The content of the bucket object is individual "text" lines (actually each line is JSON encoded). This error is in the first doFn following the TextIO - that converts each string to an value object. My log message in the exception handler shows the exact text for the string that I am expecting. I tried logging the callstack to see where the GZIP exception is thrown - turns out to be a bit hard to follow (with a bunch of dataflow classes called at the line in the processElement method that first uses the string). - Changing the lines to pure text, like "hello" and "world", gets to the JSON parser, which throws an error (since it isn't JSON any more). - If I base64 encode the lines, I [still] get the GZIP exception. - I was running an older version of Beam so I upgraded to 2.6. Didn't help - The bucket object uses *application/octet-encoding* - Tried changing the read from the bucket from the default to explicitly using uncompressed. TextIO.read.from(job.inputsPath).withCompression(Compression.UNCOMPRESSED ) One other details is that most of the code is written in Scala even though it uses the Java SDK for Beam. Any help appreciated! rdm