Based on my understanding so far, I'm targeting Dataflow with a batch
pipeline. Just starting to experiment with the setup/teardown with the
local runner - that might work fine.

Somewhat intrigued with the side inputs, though.  The pipeline might
iterate over 1,000,000 tuples of two integers.  The integers are indices
into a database of data. A given integer will be repeated in the inputs
many times.  Am I prematurely optimizing to rule out expanding the tuples
to the expanded data as each value might be expanded 100 or more times? As
side inputs, it might expand to ~100GB.  Expanding the input would be
significantly bigger.

#1 how does Dataflow schedule the pipeline with a map side input - does it
wait until the whole map is collected?
#2 can the DoFn specify that it depends on only specific keys of the side
input map?  does that affect the scheduling of the DoFn?

Thanks for any pointers...
rdm

On Wed, Jul 5, 2017 at 4:58 PM Lukasz Cwik <[email protected]> wrote:

> That should have said:
> ~100s MiBs per window in streaming pipelines
>
> On Wed, Jul 5, 2017 at 2:58 PM, Lukasz Cwik <[email protected]> wrote:
>
>> #1, side inputs supported sizes and performance are specific to a runner.
>> For example, I know that Dataflow supports side inputs which are 1+ TiB
>> (aggregate) in batch pipelines and ~100s MiBs per window because there have
>> been several one off benchmarks/runs. What kinds of sizes/use case do you
>> want to support, some runners will do a much better job with really small
>> side inputs while others will be better with really large side inputs?
>>
>> #2, this depends on which library your using to perform the REST calls
>> and whether it is thread safe. DoFns can be shared across multiple bundles
>> and can contain methods marked with @Setup/@Teardown which only get invoked
>> once per DoFn instance (which is relatively infrequently) and you could
>> store an instance per DoFn instead of a singleton if the REST library was
>> not thread safe.
>>
>> On Wed, Jul 5, 2017 at 2:45 PM, Randal Moore <[email protected]> wrote:
>>
>>> I have a step in my beam pipeline that needs some data from a rest
>>> service. The data acquired from the rest service is dependent on the
>>> context of the data being processed and relatively large. The rest client I
>>> am using isn't serializable - nor is it likely possible to make it so
>>> (background threads, etc.).
>>>
>>> #1 What are the practical limits to the size of side inputs (e.g., I
>>> could try to gather all the data from the rest service and provide it as a
>>> side-input)?
>>>
>>> #2 Assuming that using the rest client is the better option, would a
>>> singleton instance be safe way to instantiate the rest client?
>>>
>>> Thanks,
>>> rdm
>>>
>>
>>
>

Reply via email to