Not with the current capabilities of TextIO since it only provides the
contents of the file and none of the metadata.

FileIO allows you to use a ReadableByteChannel so you can read the file
incrementally in a loop and as you already have figured out you'll need to
parse and emit on every new line that you see yourself. Consider combining
the ReadableByteChannel with a BufferedReader and leverage the readLine[1]
method in a for loop.

1:
https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html#readLine--

On Mon, Aug 13, 2018 at 7:32 AM Akash Patel <[email protected]>
wrote:

> Hi,
>
> I have a large amount of CSV files stored in a GCS bucket which are
> timestamped according to their file pattern, i.e
> “ gs://deathstar/2017-10-05/plans.csv”
> “ gs://deathstar/2017-11-01/plans.csv”
>
> So basically I want to utilise the functionality of TextIO.read() where
> each line in every file is read into a PCollection but I also need to
> extract the timestamp from the file pattern and link it to each line (KV or
> something similar). However it doesn’t seem possible to extract this
> metadata unless I use FileIO. However the problem here is that the entire
> file is read not split into individual lines. Is it possible to read each
> line in the specified globed file pattern and have some parsed file
> metadata (i.e timestamp from filepattern) linked to respective element?
>
> Kind Regards,
> Akash
>
> ------------------------------
>
> This message and any attachment(s) hereto are confidential and may be
> privileged or otherwise protected from disclosure. If you are not the
> intended recipient you are hereby notified that you have received this
> message in error and that you must not - in whole or in part - review,
> copy, distribute, retain copies or disclose the contents of this message or
> any attachments hereto. If you are not the intended recipient, please
> notify the sender immediately by return e-mail and delete this message and
> any attachment from your system.
>

Reply via email to