Hi,

I am wondering how this can even work properly if you are using a local fs for 
checkpoints instead of a distributed fs. First, what happens under node 
failures, if the SSD becomes unavailable or if a task gets scheduled to a 
different machine, and can no longer access the disk with the  corresponding 
state data, or if you want to scale-out. Second, the same problem is also what 
you can observe with the job manager: how could the checkpoint coordinator, 
that runs on the JM, access a file on a local FS on a different node to cleanup 
the checkpoint data? The purpose of using a distributed fs here is that all TM 
and the JM can access the checkpoint files.

Best,
Stefan

> Am 22.07.2018 um 19:03 schrieb Ashish Pokharel <ashish...@yahoo.com>:
> 
> All,
> 
> We recently moved our Checkpoint directory from HDFS to local SSDs mounted on 
> Data Nodes (we were starting to see perf impacts on checkpoints etc as 
> complex ML apps were spinning up more and more in YARN). This worked great 
> other than the fact that when jobs are being canceled or canceled with 
> Savepoint, local data is not being cleaned up. In HDFS, Checkpoint 
> directories were cleaned up on Cancel and Cancel with Savepoints as far as I 
> can remember. I am wondering if it is permissions issue. Local disks have RWX 
> permissions for both yarn and flink headless users (flink headless user 
> submits the apps to YARN using our CICD pipeline). 
> 
> Appreciate any pointers on this.
> 
> Thanks, Ashish

Reply via email to