Re: [python] ReadFromPubSub broken in Flink

Chad Dombrova Sat, 13 Jul 2019 09:05:46 -0700

Thanks for the response, Max.  I saw that KafkaIO is now supported in
python via an external transform (
https://jira.apache.org/jira/browse/BEAM-7029), but I thought I read
somewhere that it was only supported in batch mode (though I don't see that
mentioned in the ticket or the PR, so not sure where I got that
impression).


Do I have to modify the source along the lines of the KafkaIO PRs to work
with PubSubIO, or is it already supported via some flag?

-chad


On Sat, Jul 13, 2019 at 8:43 AM Maximilian Michels <[email protected]> wrote:

> Hi Chad,
>
> This stub will only be replaced by the Dataflow service. It's an artifact
> of the pre-portability era.
>
> That said, we now have the option to replace ReadFromPubSub with an
> external transform which would utilize Java's PubSubIO via the new
> cross-language feature.
>
> Thanks,
> Max
>
> On 12.07.19 19:32, Chad Dombrova wrote:
> > Hi all,
> > This error came as a bit of a surprise.
> >
> > Here’s a snippet of the traceback (full traceback below).
> >
> > |File "apache_beam/runners/common.py", line 751, in
> > apache_beam.runners.common.DoFnRunner.process return
> > self.do_fn_invoker.invoke_process(windowed_value) File
> > "apache_beam/runners/common.py", line 423, in
> > apache_beam.runners.common.SimpleInvoker.invoke_process windowed_value,
> > self.process_method(windowed_value.value)) File
> >
> "/Users/chad/dev/beam-tests/.venv/lib/python2.7/site-packages/apache_beam/io/iobase.py",
> > line 860, in split_source AttributeError: '_PubSubSource' object has no
> > attribute 'estimate_size' [while running 'PubSubInflow/Read/Split'] |
> >
> > Flink is using _PubSubSource which is, as far as I can tell, a stub that
> > should be replaced at runtime by an actual streaming source. Is this
> > error a bug or a known limitation? If the latter, is there a Jira issue
> > and any momentum to solve this?
> >
> > I’m pretty confused by this because the Apache Beam Portability Support
> > Matrix [1] makes it pretty clear that Flink supports streaming, and the
> > docs for “Built-in I/O Transforms” lists Google PubSub and BigQuery as
> > the only IO transforms that support streaming, so if streaming works
> > with Flink, PubSub should probably be the thing it works with.
> >
> > I'm using beam 2.13.0 and flink 1.8.
> >
> > thanks,
> > chad
> >
> > [1]
> >
> https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0
> > [2] https://beam.apache.org/documentation/io/built-in/
> >
> > Full traceback:
> >
> > |Caused by: java.util.concurrent.ExecutionException:
> > java.lang.RuntimeException: Error received from SDK harness for
> > instruction 5: Traceback (most recent call last): File
> >
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
> > line 157, in _execute response = task() File
> >
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
> > line 190, in <lambda> self._execute(lambda: worker.do_instruction(work),
> > work) File
> >
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
> > line 333, in do_instruction request.instruction_id) File
> >
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
> > line 359, in process_bundle
> > bundle_processor.process_bundle(instruction_id)) File
> >
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
> > line 589, in process_bundle ].process_encoded(data.data) File
> >
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
> > line 143, in process_encoded self.output(decoded_value) File
> > "apache_beam/runners/worker/operations.py", line 246, in
> > apache_beam.runners.worker.operations.Operation.output def output(self,
> > windowed_value, output_index=0): File
> > "apache_beam/runners/worker/operations.py", line 247, in
> > apache_beam.runners.worker.operations.Operation.output
> > cython.cast(Receiver,
> > self.receivers[output_index]).receive(windowed_value) File
> > "apache_beam/runners/worker/operations.py", line 143, in
> > apache_beam.runners.worker.operations.SingletonConsumerSet.receive
> > self.consumer.process(windowed_value) File
> > "apache_beam/runners/worker/operations.py", line 583, in
> > apache_beam.runners.worker.operations.DoOperation.process with
> > self.scoped_process_state: File
> > "apache_beam/runners/worker/operations.py", line 584, in
> > apache_beam.runners.worker.operations.DoOperation.process
> > delayed_application = self.dofn_receiver.receive(o) File
> > "apache_beam/runners/common.py", line 747, in
> > apache_beam.runners.common.DoFnRunner.receive
> > self.process(windowed_value) File "apache_beam/runners/common.py", line
> > 753, in apache_beam.runners.common.DoFnRunner.process
> > self._reraise_augmented(exn) File "apache_beam/runners/common.py", line
> > 807, in apache_beam.runners.common.DoFnRunner._reraise_augmented
> > raise_with_traceback(new_exn) File "apache_beam/runners/common.py", line
> > 751, in apache_beam.runners.common.DoFnRunner.process return
> > self.do_fn_invoker.invoke_process(windowed_value) File
> > "apache_beam/runners/common.py", line 423, in
> > apache_beam.runners.common.SimpleInvoker.invoke_process windowed_value,
> > self.process_method(windowed_value.value)) File
> >
> "/Users/chad/dev/beam-tests/.venv/lib/python2.7/site-packages/apache_beam/io/iobase.py",
> > line 860, in split_source AttributeError: '_PubSubSource' object has no
> > attribute 'estimate_size' [while running 'PubSubInflow/Read/Split'] |
> >
>
>

Re: [python] ReadFromPubSub broken in Flink

Reply via email to