Not with the current capabilities of TextIO since it only provides the contents of the file and none of the metadata.
FileIO allows you to use a ReadableByteChannel so you can read the file incrementally in a loop and as you already have figured out you'll need to parse and emit on every new line that you see yourself. Consider combining the ReadableByteChannel with a BufferedReader and leverage the readLine[1] method in a for loop. 1: https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html#readLine-- On Mon, Aug 13, 2018 at 7:32 AM Akash Patel <[email protected]> wrote: > Hi, > > I have a large amount of CSV files stored in a GCS bucket which are > timestamped according to their file pattern, i.e > “ gs://deathstar/2017-10-05/plans.csv” > “ gs://deathstar/2017-11-01/plans.csv” > > So basically I want to utilise the functionality of TextIO.read() where > each line in every file is read into a PCollection but I also need to > extract the timestamp from the file pattern and link it to each line (KV or > something similar). However it doesn’t seem possible to extract this > metadata unless I use FileIO. However the problem here is that the entire > file is read not split into individual lines. Is it possible to read each > line in the specified globed file pattern and have some parsed file > metadata (i.e timestamp from filepattern) linked to respective element? > > Kind Regards, > Akash > > ------------------------------ > > This message and any attachment(s) hereto are confidential and may be > privileged or otherwise protected from disclosure. If you are not the > intended recipient you are hereby notified that you have received this > message in error and that you must not - in whole or in part - review, > copy, distribute, retain copies or disclose the contents of this message or > any attachments hereto. If you are not the intended recipient, please > notify the sender immediately by return e-mail and delete this message and > any attachment from your system. >
