Dataflow has logical partitions of work and relies on auto-scaling and dynamic work rebalancing to distribute and redistribute work. Typically machine size vs number of machines shouldn't matter unless you run really small or very large jobs since there is no point in running a job with a machine that has 32 cores if it is very short lived. Depending on your job though, things like amount of ram per CPU can matter if your job processes very large elements (for example genome sequences) or buffers a lot in memory.
On Thu, Sep 13, 2018 at 6:34 PM [email protected] <[email protected]> wrote: > Like Spark has 2 levels of processing > a) across different worker. > b) Within same Executor - multiple cores can work on different partitions. > > I know in Apache Beam with DataFlow as Runner - partitioning is > abstracted. But does Dataflow uses multiple cores to process different > partitions at same time. > > Objective is to understand what machines should be used to run Pipelines. > Does one should give a thought about cores on machine or does it not matter > ? > > Thanks > Aniruddh >
