May be I misunderstood the set up. Can you post pseudo code for part of you pipeline? (From pubsub read till your custom reader). You might need to Reshuffle before file read.
On Wed, Aug 29, 2018 at 6:18 PM Aniruddh Sharma <[email protected]> wrote: > Thanks Raghu > > Our objective is to parallelize Read operation itself. We will read a > single message from Pub Sub and we need to do a lookup of all files present > in that path which may be lets say 10,000 and then we need to run read of > 10,000 files in parallel. Right now by default , its not happening by doing > a FileIO.match and then doing a Custom Reader after that step. > > Thanks > Aniruddh > > On Wed, Aug 29, 2018 at 1:07 PM Raghu Angadi <[email protected]> wrote: > >> >> Unbounded sources likes KafkaIO, KinesisIO etc are also supported in >> addition to PubSub. Unbounded source has API to report backlog. (We will >> update the documentation). >> FileIO.matches() is a SplittableDoFn (this is new interface for IO), >> there is a way to report backlog, but not clear how well it defined or >> tested. >> >> That aside, Dataflow scales based on messages buffered between the stages >> in the pipeline, not just on the backlog reported by sources. This is a >> less reliable signal, but might help in your case. It is likely that most >> of your processing is right after reading from the files. Adding a shuffle >> between those to could help. i.e. >> From : read() --> Process() --> .... [ Here, read() and Process() might >> be fused and run in the same stage inline ] >> To : Read() --> Reshuffle --> Process() ... >> >> This is strictly a work around might be good enough in practice. >> Reshuffle can be used to shuffle randomly or to a fixed number of shards. >> It might be better some times to limit it to say 100 shards. >> >> On Wed, Aug 29, 2018 at 8:37 AM [email protected] < >> [email protected]> wrote: >> >>> Excerpt for Autoscaling on Streaming mode >>> "Currently, PubsubIO is the only source that supports autoscaling on >>> streaming pipelines. All SDK-provided sinks are supported. In this Beta >>> release, Autoscaling works smoothest when reading from Cloud Pub/Sub >>> subscriptions tied to topics published with small batches and when writing >>> to sinks with low latency. In extreme cases (i.e. Cloud Pub/Sub >>> subscriptions with large publishing batches or sinks with very high >>> latency), autoscaling is known to become coarse-grained. This will be >>> improved in future releases." >>> >> >> >> >>> >>> For our use case we cannot write messages in PubSub but we are writing >>> file paths in PubSub and then after reading the FilePath we want to do >>> FileIO.match and go from there. But for these use cases Autoscaling is not >>> kicking in Streaming mode. >>> >>> Is there a way to override the 'Backlog' (hint for autcoscaling to kick >>> in) that in this case one message in pub sub is indicator of millions of >>> message so that Autoscaling treats as if one message in PubSub is equal to >>> amount of 1 million pending messages and if if it sees there are 100 >>> messages in PubSub then it knows it has to process 100 million records >>> corresponding to these 100 messages and start kicking in Autoscaling. >>> >>> Or may be I am thinking on completely wrong line. Basically any way >>> where we can trigger Autoscaling in streaming mode after FileIO.match >>> operation >>> >>> Thanks >>> Aniruddh >>> >>
