Streaming writes to GCS

Lina Mårtensson Tue, 26 Apr 2022 14:29:01 -0700

Hi Beamers,

I'm testing out streaming writes to GCS in an attempt to see if Beam will
work well for our needs, but I'm having trouble making this work.


I tried having my pipeline run similar to described in
https://stackoverflow.com/questions/56960622/writetotext-is-only-writing-to-temp-files

p | 'Read PubSub metadata' >>
ReadFromPubSub(subscription=known_args.input_subscription)
  | 'Convert Message to JSON' >> beam.Map(lambda message:
json.loads(message))
  | 'Extract File Names' >> beam.ParDo(ExtractFn())
  | 'Read Files' >> beam.io.ReadAllFromText()
  | "Window" >>
beam.WindowInto(beam.window.FixedWindows(30),
trigger=beam.transforms.trigger.AfterProcessingTime(),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
  | "Combine" >>
beam.transforms.core.CombineGlobally(CombineFn).without_defaults()
    | "Output" >> beam.io.WriteToText(known_args.output)

But, just like the Stack Overflow poster, my files never get written to
output GCS files. (There are temp files with data, but that looks like it
might be all of the input data, not the output data.)

There's an answer there saying "From engaging with the Apache Beam Python
guys, streaming writes to GCS (or local filesystem) is not yet supported in
Python, hence why the streaming write does not occur; only unbounded
targets are currently supported (e.g. Big Query tables).
Apparently this will be supported in the upcoming release of Beam for
Python v2.14.0."

I'm on 2.37.0 - was support for streaming file writes ever added?

Thanks!

Streaming writes to GCS

Reply via email to