Hi friends,

I encountered an issue with the beam python SDK (2.43.0) recently where I
was using ReadFromTextWithFilename on a Google Cloud Storage (GCS) bucket
that contains roughly 95k  gzip compressed CSV files. One of the files was
truncated in transit, so the job ran for a couple of hours before returning
an exception like zlib.error: Error -3 while decompressing data: incorrect
header check from within the apache_beam.io.Filesystem module. The
exception didn't indicate the filename for the truncated file, and from
looking through the standard library, I couldn't find any mechanism to
handle the exception or to return additional context that would have
allowed me to remediate the situation.

Is there an example of how to handle this situation? Ideally, the library
would return a PCollection of filenames that encountered errors while
reading or something similar to that for further processing rather than
causing a job to crash.

Reply via email to