Re: Resource Planning

2021-06-17 Thread Robert Metzger
Hi, since your state (150gb) seems to fit into memory (700gb), I would recommend trying the HashMapStateBackend: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/state_backends/#the-hashmapstatebackend (unless you know that your state size is going to increase a lot

Re: Resource Planning

2021-06-16 Thread Rommel Holmes
Hi, Xintong and Robert Thanks for the reply. The checkpoint size for our job is 10-20GB since we are doing incremental checkpointing, if we do a savepoint, it can be as big as 150GB. 1) We will try to make Flink instance bigger. 2) Thanks for the pointer, we will take a look. 3) We do have CPU

Re: Resource Planning

2021-06-16 Thread Robert Metzger
Hi Thomas, My gut feeling is that you can use the available resources more efficiently. What's the size of a checkpoint for your job (you can see that from the UI)? Given that your cluster has has an aggregate of 64 * 12 = 768gb of memory available, you might be able to do everything in memory

Re: Resource Planning

2021-06-15 Thread Xintong Song
Hi Thomas, It would be helpful if you can provide the jobmanager/taskmanager logs, and gc logs if possible. Additionally, you may consider to monitor the cpu/memory related metrics [1], see if there's anything abnormal when the problem is observed. Thank you~ Xintong Song [1]

Resource Planning

2021-06-15 Thread Thomas Wang
Hi, I'm trying to see if we have been given enough resources (i.e. CPU and memory) to each task node to perform a deduplication job. Currently, the job is not running very stable. What I have been observing is that after a couple of days run, we will suddenly see backpressure happen on one