Hi André, As Reza pointed out, it may be due to fusion.
Dataflow scaling is THROUGHPUT_BASED but it may use fusion for optimization. So, what may happen is: step 2 and 3 are executed fused, and throughput is calculated based only on the output of 1. Doing a reshuffle between 2 and 3 would prevent fusion and Dataflow would understand that step 2 produces a large amount of items that can be processed in step 3 in parallel. A way of avoiding this behavior is making sure your ParDos are 1:1 (one input item produces one output item). I don't know if there is a programmatic way of telling Dataflow when it's 1:N but I don't think there is, so the default way is prevent fusion using Reshuffle or side input. If you are not familiar with fusion, Dataflow may optimize your transformation graph in a way that a series of ParDos are executed as one large transform. So, step 2 and 3 would be executed together and output of step 2 would not be parallelized. I think you may find more info at Dataflow docs. Em ter., 31 de dez. de 2019 às 10:25, André Rocha Silva < [email protected]> escreveu: > Hi! > > I have a cloud dataflow job that is not scaling. > > The job sequence is the following: > 1 - [io] Read from a file in the bucket (1 element out) > 2 - [ParDo] With the file information, get a query from a database (10,000 > elements out) > 3 - [ParDo] Works with the elements > > But when I read from a file that already contains the same database query > result it scales to 60+ workers: > 1 - [io] Read from a file in the bucket (10,000 elements out) > 2 - [ParDo] Works with the elements > > Do I have to develop an I/O connector for the apache beam to know how many > elements its dealing with? > > Best regards > André Rocha Silva > > > -- []s Leonardo Alves Miguel Data Engineer (16) 3509-5515 | www.arquivei.com.br <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> [image: Arquivei.com.br – Inteligência em Notas Fiscais] <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> [image: Google seleciona Arquivei para imersão e mentoria no Vale do Silício] <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> <https://www.facebook.com/arquivei> <https://www.linkedin.com/company/arquivei> <https://www.youtube.com/watch?v=KJFrh8h4Zds&yt%3Acc=on>
