Re: Best way to inject external per key config into the pipeline

Lukasz Cwik Thu, 12 Oct 2017 16:22:58 -0700

Most people use a static singleton pattern that is lazily initialized on
first access that is shared across anyone who accesses it. They don't store
it part of the DoFn.


On Thu, Oct 12, 2017 at 3:17 PM, Eric Fang <[email protected]> wrote:

> Hi Kevin,
>
> Just want to clarify a little of what you mentioned. Do you mean declare a
> static variable (the LoadingCache) for the subclass of DoFn? It looks like
> the proper way is to have the loading cache created for each DoFn and not
> to share it between different stages of the pipeline? When should I config
> the LoadingCache? Overriding the startBundle function for the DoFn?
>
> I tried to pass a instance variable of the LoadingCache into the
> constructor of the subclass of the DoFn and I am running into serialization
> issue. So I assume that's not the proper way of doing it?
>
> Thanks
> Eric
>
> On Sat, Oct 7, 2017 at 8:59 PM Kevin Peterson <[email protected]> wrote:
>
>> I've used Guava's loading cache for a similar problem before, and it is
>> serializable, so you just need to make sure your loading function is also
>> serializable. The cached values aren't serialized, they would just be
>> reloaded instead, so no worries there.
>>
>> If you want to share the cache, you'll need to make it static so you
>> don't end up with an instance per thread.
>>
>>
>> On Oct 7, 2017 6:40 PM, "Yihua Fang" <[email protected]> wrote:
>>
>> Hi Lukasz,
>>
>> Thanks a lot for the suggestion. Just one more question, if I were to use
>> an external lib such as Guava for caching, is there any special attention
>> need to be paid to use it in Beam, particularly on the serializing side
>> since I know runners needs to serialize classes to ship to workers?
>>
>> Eric
>>
>> On Thu, Oct 5, 2017 at 10:58 AM Lukasz Cwik <[email protected]> wrote:
>>
>>> Yes, everytime you start the pipeline you need to read in all the config
>>> data. You can do this by flattening a bounded source which reads the
>>> current state of config data with a streaming source that gets updates and
>>> then use that as a side input. On the other hand, with a few million keys,
>>> using an in memory cache with your own refresh policy that contacts your
>>> datastore would also likely work well.
>>>
>>> On Thu, Oct 5, 2017 at 10:16 AM, Yihua Fang <[email protected]>
>>> wrote:
>>>
>>>> I thought about streaming the updates and using side input, but since
>>>> the config data are not persisted in the data pipeline, how would the
>>>> config being populated in the first place. Is the solution to populate
>>>> through streaming every time the pipeline is rebooted?
>>>>
>>>> 1. The config data can be a few minutes to a few hours late. No need to
>>>> immediately reflects the config.
>>>> 2. The config data shouldn't be change often. It is configured by human
>>>> users.
>>>> 3. The config data per key should be about 10-20 key value pairs.
>>>> 4. Ideally the key number is in the range of a few millions, but a few
>>>> thousands to begin with.
>>>>
>>>> Thanks
>>>> Eric
>>>>
>>>> On Thu, Oct 5, 2017 at 9:09 AM Lukasz Cwik <[email protected]> wrote:
>>>>
>>>>> Can you stream the updates to the keys into the pipeline and then use
>>>>> it as a side input performing a join against on your main stream that 
>>>>> needs
>>>>> the config data?
>>>>> You could also use an in memory cache that periodically refreshes keys
>>>>> from the external source.
>>>>>
>>>>> A better answer depends on:
>>>>> * how stale can the config data be?
>>>>> * how often does the config data change?
>>>>> * how much config data you expect to have per key?
>>>>> * how many keys do you expect to have?
>>>>>
>>>>>
>>>>> On Wed, Oct 4, 2017 at 5:41 PM, Yihua Fang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The use case is that I have an external source that store a
>>>>>> configuration for each key accessible via restful APIs and Beam pipeline
>>>>>> should use the config to process each element for each key. What is the
>>>>>> best approach to facilitate injecting the latest config into the 
>>>>>> pipeline?
>>>>>>
>>>>>> Thanks
>>>>>> Eric
>>>>>>
>>>>>
>>>>>
>>>
>> --
>
> Eric Fang
>
> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
> <https://maps.google.com/?q=10054+Pasadena+Ave,+Cupertino,+CA+95014&entry=gmail&source=g>
>
>
> This electronic mail transmission may contain private, confidential and
> privileged information that is for the sole use of the intended
> recipient.  If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution, archiving, or
> copying of this communication is strictly prohibited.  If you received
> this communication in error, please reply to this message immediately and
> delete the original message and the associated reply.
>

Re: Best way to inject external per key config into the pipeline

Reply via email to