Hi!

if I use parallelism of 2 or 4 - it takes the same time.
>
It might be that there is no data in some parallelisms. You can click on
the nodes in Flink web UI and see if it is the case for each parallelism,
or you can check out the metrics of each operator.

if I don't increase parallelism and just run the job on a fixed number of
> task slots, the job will fail (due to lack of memory on the task manager)or
> it will just take longer time to process the data?
>
It depends on a lot of aspects, such as the type of source you are using,
the type of operators you are running, etc. Ideally we hope it will just
take longer but for some specific operators or connectors it might fail.
This is where users have to tune their jobs.

Gorjan Todorovski <gor...@gmail.com> 于2021年8月13日周五 下午6:48写道:

> Hi!
>
> I want to implement a Flink cluster as a native Kubernetes session
> cluster, with intention of executing Apache Beam jobs that will process
> only batch data, but I am not sure I understand how I would scale the
> cluster if I need to process large datasets.
>
> My understanding is that to be able to process a bigger dataset, you could
> run it with higher parallelism, so the processing will be spread on
> multiple task slots, which might run multiple nodes.
> But running Beam jobs which actually in my case execute TensorFlow
> Extended pipelines, I am not able to have control over partitioning over
> some keys and I don't see any difference in throughput (the time it takes
> to process specific dataset), if I use parallelism of 2 or 4 - it takes the
> same time.
>
> Also, does it mean if I want to process a dataset of any size since the
> execution is of type "PIPELINED", does this mean, if I don't increase
> parallelism and just run the job on a fixed number of task slots, the job
> will fail (due to lack of memory on the task manager)or it will just take
> longer time to process the data?
>
> Thanks,
> Gorjan
>

Reply via email to