Re: [python] ReadFromPubSub broken in Flink

Chamikara Jayalath Sun, 14 Jul 2019 20:43:02 -0700

On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova <[email protected]> wrote:


> Hi Chamikara,
>
>
>>>    - why not make this part of the pipeline options?  does it really
>>>       need to vary from transform to transform?
>>>
>>> It's possible for the same pipeline to connect to multiple expansion
>> services, to use transforms from more than one SDK language and/or version.
>>
> There are only 3 languages supported, so excluding the use’s chosen
> language we’re only talking about 2 options (i.e. for python, they’re java
> and go). The reality is that Java provides the superset of all the IO
> functionality of Go and Python, and the addition of external transforms is
> only going to discourage the addition of more native IO transforms in
> python and go (which is ultimately a good thing!). So it seems like a poor
> UX choice to make users provide the expansion service to every single
> external IO transform when the reality is that 99.9% of the time it’ll be
> the same url for any given pipeline. Correct me if I’m wrong, but in a
> production scenario the expansion service not be the current default,
> localhost:8097, correct? That means users would need to always specific
> this arg.
>
> Here’s an alternate proposal: instead of providing the expansion service
> as a URL in a transform’s __init__ arg, i.e.
> expansion_service='localhost:8097', make it a symbolic name, like
> expansion_service='java' (an external transform is bound to a particular
> source SDK, e.g. KafkaIO is bound to Java, so this default seems reasonable
> to me). Then provide a pipeline option to specify the url of an expansion
> service alias in the form alias@url (e.g.
> --expansion_service=java@myexpansionservice:8097).
>
Right, I was pointing out that this is a per-transform option, not a
per-pipeline option. Also, exact interface of transform __init__ is upto
transform authors so I don't think Beam SDK can/should dictate that all
external transforms use this notation. Also this will require all external
transforms implementations to internally use a pre-specified pipeline
option when construction parameters (for example, "expansion_service") have
certain values (for example, "java"). This again cannot be enforced.

> Are you talking about key/value coders of the Kafka external transform ?
>> Story of coders is bit complicated for cross-language transform. Even if
>> we get a bytestring from Java, how can we make sure that that is
>> processable in Python ? For example, it might be a serialized Java object.
>>
> IIUC, it’s not as if you support that with the current design, do you? If
> it’s a Java object that your native IO transform decodes in Java, then how
> are you going to get that to Python? Presumably the reason it’s encoded as
> a Java object is because it can’t be represented using a cross-language
> coder.
>

This was my point.  There's no easy way to map bytes produced by an
external transform to a Python coder. So we have to go through standard
coders (or common formats) in SDK boundaries till we have portable schema
support.

On the other hand, if I’m authoring a beam pipeline in python using an
> external transform like PubSubIO, then it’s desirable for me to write a
> pickled python object to WriteToPubSub and get that back in a
> ReadFromPubSub in another python-based pipeline. In other words, when it
> comes to coders, it seems we should be favoring the language that is
> *using* the external transform, rather than the native language of the
> transform itself.
>
I see, note that bytes is one of the standard coders. So if you want to
generate bytes to be written to PubSub in Python SDK and write those bytes
to PubSub using a cross-language PubSub sink transform that will work.


> All of that said, it occurs to me that for ReadFromPubSub, we do explicit
> decoding in a subsequent transform rather than as part of ReadFromPubSub,
> so I’m confused why ReadFromKafka needs to know about coders at all. Is
> that behavior specific to Kafka?
>
Assuming you are talking about key/value deserializers, yes, this is Kafka
specific.
https://github.com/apache/beam/blob/077223e55b4f73c7dda241e0b6c0322a2cb33e84/sdks/python/apache_beam/io/external/kafka.py#L133


>
>> This is great and contributions are welcome. BTW Max and others, do you
>> think it will help to add an expanded roadmap on cross-language transforms
>> to [3] that will better describe the current status and future roadmap of
>> cross-language transform support for various SDKs and runners ?
>>
> More info would be great. I’ve started looking at the changes required to
> make KafkaIO work as an external transform and I have a number of questions
> already. I’ll probably start asking questions on specific lines this old
> PR <https://github.com/apache/beam/pull/8251/files> unless you’d like me
> to use a different forum.
>

Please use the dev list instead of commenting on the submitted PR so that
it gets the attention of the interested parties. Once you start a new PR,
review process can be done there and summarized in the dev list as needed.

Thanks,
Cham


> thanks,
> -chad
>

Re: [python] ReadFromPubSub broken in Flink

Reply via email to