Re: Best way to inject external per key config into the pipeline

Eric Fang Thu, 12 Oct 2017 15:17:48 -0700

Hi Kevin,

Just want to clarify a little of what you mentioned. Do you mean declare a
static variable (the LoadingCache) for the subclass of DoFn? It looks like
the proper way is to have the loading cache created for each DoFn and not
to share it between different stages of the pipeline? When should I config
the LoadingCache? Overriding the startBundle function for the DoFn?


I tried to pass a instance variable of the LoadingCache into the
constructor of the subclass of the DoFn and I am running into serialization
issue. So I assume that's not the proper way of doing it?

Thanks
Eric

On Sat, Oct 7, 2017 at 8:59 PM Kevin Peterson <[email protected]> wrote:

> I've used Guava's loading cache for a similar problem before, and it is
> serializable, so you just need to make sure your loading function is also
> serializable. The cached values aren't serialized, they would just be
> reloaded instead, so no worries there.
>
> If you want to share the cache, you'll need to make it static so you don't
> end up with an instance per thread.
>
>
> On Oct 7, 2017 6:40 PM, "Yihua Fang" <[email protected]> wrote:
>
> Hi Lukasz,
>
> Thanks a lot for the suggestion. Just one more question, if I were to use
> an external lib such as Guava for caching, is there any special attention
> need to be paid to use it in Beam, particularly on the serializing side
> since I know runners needs to serialize classes to ship to workers?
>
> Eric
>
> On Thu, Oct 5, 2017 at 10:58 AM Lukasz Cwik <[email protected]> wrote:
>
>> Yes, everytime you start the pipeline you need to read in all the config
>> data. You can do this by flattening a bounded source which reads the
>> current state of config data with a streaming source that gets updates and
>> then use that as a side input. On the other hand, with a few million keys,
>> using an in memory cache with your own refresh policy that contacts your
>> datastore would also likely work well.
>>
>> On Thu, Oct 5, 2017 at 10:16 AM, Yihua Fang <[email protected]>
>> wrote:
>>
>>> I thought about streaming the updates and using side input, but since
>>> the config data are not persisted in the data pipeline, how would the
>>> config being populated in the first place. Is the solution to populate
>>> through streaming every time the pipeline is rebooted?
>>>
>>> 1. The config data can be a few minutes to a few hours late. No need to
>>> immediately reflects the config.
>>> 2. The config data shouldn't be change often. It is configured by human
>>> users.
>>> 3. The config data per key should be about 10-20 key value pairs.
>>> 4. Ideally the key number is in the range of a few millions, but a few
>>> thousands to begin with.
>>>
>>> Thanks
>>> Eric
>>>
>>> On Thu, Oct 5, 2017 at 9:09 AM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> Can you stream the updates to the keys into the pipeline and then use
>>>> it as a side input performing a join against on your main stream that needs
>>>> the config data?
>>>> You could also use an in memory cache that periodically refreshes keys
>>>> from the external source.
>>>>
>>>> A better answer depends on:
>>>> * how stale can the config data be?
>>>> * how often does the config data change?
>>>> * how much config data you expect to have per key?
>>>> * how many keys do you expect to have?
>>>>
>>>>
>>>> On Wed, Oct 4, 2017 at 5:41 PM, Yihua Fang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The use case is that I have an external source that store a
>>>>> configuration for each key accessible via restful APIs and Beam pipeline
>>>>> should use the config to process each element for each key. What is the
>>>>> best approach to facilitate injecting the latest config into the pipeline?
>>>>>
>>>>> Thanks
>>>>> Eric
>>>>>
>>>>
>>>>
>>
> --

Eric Fang

Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014


This electronic mail transmission may contain private, confidential and
privileged information that is for the sole use of the intended recipient.
If you are not the intended recipient, you are hereby notified that any
review, dissemination, distribution, archiving, or copying of this
communication
is strictly prohibited.  If you received this communication in error,
please reply to this message immediately and delete the original message
and the associated reply.

Re: Best way to inject external per key config into the pipeline

Reply via email to