What is the correct way to implement a custom non-splittable file parser in python?
My desired end-state is: 1) use Read to pass a file pattern (with wild cards) pointing to several XML files on remote storage (S3 or GCS). 2) each file is parsed as a single element (XML cannot be processed line-by-line) resulting in a PCollection. 3) combine all PCollections together. I've subclassed FileBasedSource, which seems to give me everything out of the box. However I have a problem with zipped files. The self.open_file(fname) method returns a file object. For non-compressed files I can call self.open_file(fname).read(). But for compressed files I have a missing argument error and must provide the number of bytes to read: self.open_file(fname).read(num_bytes). Is it possible to implement a FileBasedSource that works generically for compressed and non-compressed non-splittable files?
