Re: Streaming mode Dataflow - how to make Autoscaling kick in for FileIO.match operations

Aniruddh Sharma Wed, 29 Aug 2018 18:19:01 -0700

Thanks Raghu

Our objective is to parallelize Read operation itself. We will read a
single message from Pub Sub and we need to do a lookup of all files present
in that path which may be lets say 10,000 and then we need to run read of
10,000 files in parallel. Right now by default , its not happening by doing
a FileIO.match and then doing a Custom Reader after that step.


Thanks
Aniruddh

On Wed, Aug 29, 2018 at 1:07 PM Raghu Angadi <[email protected]> wrote:

>
> Unbounded sources likes KafkaIO, KinesisIO etc are also supported in
> addition to PubSub. Unbounded source has API to report backlog. (We will
> update the documentation).
> FileIO.matches() is a SplittableDoFn (this is new interface for IO), there
> is a way to report backlog, but not clear how well it defined or tested.
>
> That aside, Dataflow scales based on messages buffered between the stages
> in the pipeline, not just on the backlog reported by sources. This is a
> less reliable signal, but might help in your case. It is likely that most
> of your processing is right after reading from the files. Adding a shuffle
> between those to could help. i.e.
> From : read() --> Process() --> ....  [ Here, read() and Process() might
> be fused and run in the same stage inline ]
> To  : Read() --> Reshuffle --> Process() ...
>
> This is strictly a work around might be good enough in practice. Reshuffle
> can be used to shuffle randomly or to a fixed number of shards. It might be
> better some times to limit it to say 100 shards.
>
> On Wed, Aug 29, 2018 at 8:37 AM [email protected] <[email protected]>
> wrote:
>
>> Excerpt for Autoscaling on Streaming mode
>> "Currently, PubsubIO is the only source that supports autoscaling on
>> streaming pipelines. All SDK-provided sinks are supported. In this Beta
>> release, Autoscaling works smoothest when reading from Cloud Pub/Sub
>> subscriptions tied to topics published with small batches and when writing
>> to sinks with low latency. In extreme cases (i.e. Cloud Pub/Sub
>> subscriptions with large publishing batches or sinks with very high
>> latency), autoscaling is known to become coarse-grained. This will be
>> improved in future releases."
>>
>
>
>
>>
>> For our use case we cannot write messages in PubSub but we are writing
>> file paths in PubSub and then after reading the FilePath we want to do
>> FileIO.match and go from there. But for these use cases Autoscaling is not
>> kicking in Streaming mode.
>>
>> Is there a way to override the 'Backlog' (hint for autcoscaling to kick
>> in) that in this case one message in pub sub is indicator of millions of
>> message so that Autoscaling treats as if one message in PubSub is equal to
>> amount of 1 million pending messages and if if it sees there are 100
>> messages in PubSub then it knows it has to process 100 million records
>> corresponding to these 100 messages and start kicking in Autoscaling.
>>
>> Or may be I am thinking on completely wrong line. Basically any way where
>> we can trigger Autoscaling in streaming mode after FileIO.match operation
>>
>> Thanks
>> Aniruddh
>>
>

Re: Streaming mode Dataflow - how to make Autoscaling kick in for FileIO.match operations

Reply via email to