Re: [python] ReadFromPubSub broken in Flink

Robert Bradshaw Mon, 15 Jul 2019 04:08:51 -0700

On Mon, Jul 15, 2019 at 5:42 AM Chamikara Jayalath <chamik...@google.com> wrote:
>
> On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova <chad...@gmail.com> wrote:
>>
>> Hi Chamikara,
>>>>
>>>> why not make this part of the pipeline options?  does it really need to 
>>>> vary from transform to transform?
>>>
>>> It's possible for the same pipeline to connect to multiple expansion 
>>> services, to use transforms from more than one SDK language and/or version.
>>
>> There are only 3 languages supported, so excluding the use’s chosen language 
>> we’re only talking about 2 options (i.e. for python, they’re java and go). 
>> The reality is that Java provides the superset of all the IO functionality 
>> of Go and Python, and the addition of external transforms is only going to 
>> discourage the addition of more native IO transforms in python and go (which 
>> is ultimately a good thing!). So it seems like a poor UX choice to make 
>> users provide the expansion service to every single external IO transform 
>> when the reality is that 99.9% of the time it’ll be the same url for any 
>> given pipeline. Correct me if I’m wrong, but in a production scenario the 
>> expansion service not be the current default, localhost:8097, correct? That 
>> means users would need to always specific this arg.
>>
>> Here’s an alternate proposal: instead of providing the expansion service as 
>> a URL in a transform’s __init__ arg, i.e. 
>> expansion_service='localhost:8097', make it a symbolic name, like 
>> expansion_service='java' (an external transform is bound to a particular 
>> source SDK, e.g. KafkaIO is bound to Java, so this default seems reasonable 
>> to me). Then provide a pipeline option to specify the url of an expansion 
>> service alias in the form alias@url (e.g. 
>> --expansion_service=java@myexpansionservice:8097).


I like the idea of providing symbolic references, though there's a
question of resolution. Something like "java" would likely be
insufficient, as that would require the "java" service to build in
every IO and library (which may not even be possible, due to diamond
dependency problems, even if it were desirable (could be very
wasteful), but a reasonable subset could have big wins). The lifetime
of services is another question that is as-yet unresolved, and as you
point out this is very much a UX question as well.

Really, what I think is missing is the notion of "service (aka
cross-language library) as a release artifact." This will consist of a
global, versioned identification of the code (and supporting data)
itself (e.g. a jar or docker image) plus some parameters for how to
start it up (hopefully minimal if we have some good conventions).
We're tackling the same issue with workers (environment
specifications), runners (I've been working on
https://github.com/apache/beam/pull/9043) and now expansion services.

This is starting to sound more like a topic of discussion for the dev
list. Added.

> Right, I was pointing out that this is a per-transform option, not a 
> per-pipeline option. Also, exact interface of transform __init__ is upto 
> transform authors so I don't think Beam SDK can/should dictate that all 
> external transforms use this notation. Also this will require all external 
> transforms implementations to internally use a pre-specified pipeline option 
> when construction parameters (for example, "expansion_service") have certain 
> values (for example, "java"). This again cannot be enforced.
>>>
>>> Are you talking about key/value coders of the Kafka external transform ?
>>> Story of coders is bit complicated for cross-language transform. Even if we 
>>> get a bytestring from Java, how can we make sure that that is processable 
>>> in Python ? For example, it might be a serialized Java object.
>>
>> IIUC, it’s not as if you support that with the current design, do you? If 
>> it’s a Java object that your native IO transform decodes in Java, then how 
>> are you going to get that to Python? Presumably the reason it’s encoded as a 
>> Java object is because it can’t be represented using a cross-language coder.
>
>
> This was my point.  There's no easy way to map bytes produced by an external 
> transform to a Python coder. So we have to go through standard coders (or 
> common formats) in SDK boundaries till we have portable schema support.
>
>> On the other hand, if I’m authoring a beam pipeline in python using an 
>> external transform like PubSubIO, then it’s desirable for me to write a 
>> pickled python object to WriteToPubSub and get that back in a ReadFromPubSub 
>> in another python-based pipeline. In other words, when it comes to coders, 
>> it seems we should be favoring the language that is using the external 
>> transform, rather than the native language of the transform itself.
>
> I see, note that bytes is one of the standard coders. So if you want to 
> generate bytes to be written to PubSub in Python SDK and write those bytes to 
> PubSub using a cross-language PubSub sink transform that will work.
>
>>
>> All of that said, it occurs to me that for ReadFromPubSub, we do explicit 
>> decoding in a subsequent transform rather than as part of ReadFromPubSub, so 
>> I’m confused why ReadFromKafka needs to know about coders at all. Is that 
>> behavior specific to Kafka?
>
> Assuming you are talking about key/value deserializers, yes, this is Kafka 
> specific.
> https://github.com/apache/beam/blob/077223e55b4f73c7dda241e0b6c0322a2cb33e84/sdks/python/apache_beam/io/external/kafka.py#L133
>
>>>
>>>
>>> This is great and contributions are welcome. BTW Max and others, do you 
>>> think it will help to add an expanded roadmap on cross-language transforms 
>>> to [3] that will better describe the current status and future roadmap of 
>>> cross-language transform support for various SDKs and runners ?
>>
>> More info would be great. I’ve started looking at the changes required to 
>> make KafkaIO work as an external transform and I have a number of questions 
>> already. I’ll probably start asking questions on specific lines this old PR 
>> unless you’d like me to use a different forum.
>
>
> Please use the dev list instead of commenting on the submitted PR so that it 
> gets the attention of the interested parties. Once you start a new PR, review 
> process can be done there and summarized in the dev list as needed.
>
> Thanks,
> Cham
>
>>
>> thanks,
>> -chad

Re: [python] ReadFromPubSub broken in Flink

Reply via email to