Hello! I’m looking for some help writing a custom transform that reads from
an unbound source. A basic version of this would look something like this:
import apache_beam as beam
import myeventlibrary
class _ReadEventsFn(beam.DoFn):
def process(self, unused_element):
subscriber = myeventlibrary.Subscriber()
while True:
# blocks until an event is received
event = subscriber.get()
yield event
with beam.Pipeline() as pipe:
(
pipe
| beam.Impulse()
| beam.ParDo(_ReadEventsFn())
| beam.Map(print)
)
I have a few questions:
- When executing this in the DirectRunner with the default number of
workers (1), is it correct to assume the process would be constantly
blocking and rarely processing other parts of the pipeline?
- I noticed there is a decorator DoFn.unbounded_per_element which seems
useful, but it’s a bit unclear what it does. I also noticed in the java
docs it is an error to apply this decorator if the DoFn is not a SDF.
- I took a stab at writing this as a SDF. The incoming events are not
really something that can be read in parallel and they trickle in slowly.
In this scenario is there anything to gain by using a SDF?
Thanks!
-Sam