Unbounded sources likes KafkaIO, KinesisIO etc are also supported in
addition to PubSub. Unbounded source has API to report backlog. (We will
update the documentation).
FileIO.matches() is a SplittableDoFn (this is new interface for IO), there
is a way to report backlog, but not clear how well it defined or tested.

That aside, Dataflow scales based on messages buffered between the stages
in the pipeline, not just on the backlog reported by sources. This is a
less reliable signal, but might help in your case. It is likely that most
of your processing is right after reading from the files. Adding a shuffle
between those to could help. i.e.
>From : read() --> Process() --> ....  [ Here, read() and Process() might be
fused and run in the same stage inline ]
To  : Read() --> Reshuffle --> Process() ...

This is strictly a work around might be good enough in practice. Reshuffle
can be used to shuffle randomly or to a fixed number of shards. It might be
better some times to limit it to say 100 shards.

On Wed, Aug 29, 2018 at 8:37 AM [email protected] <[email protected]>
wrote:

> Excerpt for Autoscaling on Streaming mode
> "Currently, PubsubIO is the only source that supports autoscaling on
> streaming pipelines. All SDK-provided sinks are supported. In this Beta
> release, Autoscaling works smoothest when reading from Cloud Pub/Sub
> subscriptions tied to topics published with small batches and when writing
> to sinks with low latency. In extreme cases (i.e. Cloud Pub/Sub
> subscriptions with large publishing batches or sinks with very high
> latency), autoscaling is known to become coarse-grained. This will be
> improved in future releases."
>



>
> For our use case we cannot write messages in PubSub but we are writing
> file paths in PubSub and then after reading the FilePath we want to do
> FileIO.match and go from there. But for these use cases Autoscaling is not
> kicking in Streaming mode.
>
> Is there a way to override the 'Backlog' (hint for autcoscaling to kick
> in) that in this case one message in pub sub is indicator of millions of
> message so that Autoscaling treats as if one message in PubSub is equal to
> amount of 1 million pending messages and if if it sees there are 100
> messages in PubSub then it knows it has to process 100 million records
> corresponding to these 100 messages and start kicking in Autoscaling.
>
> Or may be I am thinking on completely wrong line. Basically any way where
> we can trigger Autoscaling in streaming mode after FileIO.match operation
>
> Thanks
> Aniruddh
>

Reply via email to