Re: cores and partitions in DataFlow

Lukasz Cwik Fri, 14 Sep 2018 08:54:27 -0700

Dataflow has logical partitions of work and relies on auto-scaling and
dynamic work rebalancing to distribute and redistribute work. Typically
machine size vs number of machines shouldn't matter unless you run really
small or very large jobs since there is no point in running a job with a
machine that has 32 cores if it is very short lived. Depending on your job
though, things like amount of ram per CPU can matter if your job processes
very large elements (for example genome sequences) or buffers a lot in
memory.



On Thu, Sep 13, 2018 at 6:34 PM [email protected] <[email protected]>
wrote:

> Like Spark has 2 levels of processing
> a) across different worker.
> b) Within same Executor - multiple cores can work on different partitions.
>
> I know in Apache Beam with DataFlow as Runner - partitioning is
> abstracted. But does Dataflow uses multiple cores to process different
> partitions at same time.
>
> Objective is to understand what machines should be used to run Pipelines.
> Does one should give a thought about cores on machine or does it not matter
> ?
>
> Thanks
> Aniruddh
>

Re: cores and partitions in DataFlow

Reply via email to