Thanks Raghu Our objective is to parallelize Read operation itself. We will read a single message from Pub Sub and we need to do a lookup of all files present in that path which may be lets say 10,000 and then we need to run read of 10,000 files in parallel. Right now by default , its not happening by doing a FileIO.match and then doing a Custom Reader after that step.
Thanks Aniruddh On Wed, Aug 29, 2018 at 1:07 PM Raghu Angadi <[email protected]> wrote: > > Unbounded sources likes KafkaIO, KinesisIO etc are also supported in > addition to PubSub. Unbounded source has API to report backlog. (We will > update the documentation). > FileIO.matches() is a SplittableDoFn (this is new interface for IO), there > is a way to report backlog, but not clear how well it defined or tested. > > That aside, Dataflow scales based on messages buffered between the stages > in the pipeline, not just on the backlog reported by sources. This is a > less reliable signal, but might help in your case. It is likely that most > of your processing is right after reading from the files. Adding a shuffle > between those to could help. i.e. > From : read() --> Process() --> .... [ Here, read() and Process() might > be fused and run in the same stage inline ] > To : Read() --> Reshuffle --> Process() ... > > This is strictly a work around might be good enough in practice. Reshuffle > can be used to shuffle randomly or to a fixed number of shards. It might be > better some times to limit it to say 100 shards. > > On Wed, Aug 29, 2018 at 8:37 AM [email protected] <[email protected]> > wrote: > >> Excerpt for Autoscaling on Streaming mode >> "Currently, PubsubIO is the only source that supports autoscaling on >> streaming pipelines. All SDK-provided sinks are supported. In this Beta >> release, Autoscaling works smoothest when reading from Cloud Pub/Sub >> subscriptions tied to topics published with small batches and when writing >> to sinks with low latency. In extreme cases (i.e. Cloud Pub/Sub >> subscriptions with large publishing batches or sinks with very high >> latency), autoscaling is known to become coarse-grained. This will be >> improved in future releases." >> > > > >> >> For our use case we cannot write messages in PubSub but we are writing >> file paths in PubSub and then after reading the FilePath we want to do >> FileIO.match and go from there. But for these use cases Autoscaling is not >> kicking in Streaming mode. >> >> Is there a way to override the 'Backlog' (hint for autcoscaling to kick >> in) that in this case one message in pub sub is indicator of millions of >> message so that Autoscaling treats as if one message in PubSub is equal to >> amount of 1 million pending messages and if if it sees there are 100 >> messages in PubSub then it knows it has to process 100 million records >> corresponding to these 100 messages and start kicking in Autoscaling. >> >> Or may be I am thinking on completely wrong line. Basically any way where >> we can trigger Autoscaling in streaming mode after FileIO.match operation >> >> Thanks >> Aniruddh >> >
