Sorry, Just a follow-up. In absence of NAS then the best option to go with here is checkpoint and savepoints both on HDFS and StateBackend using local SSDs then? We were trying to not even hit HDFS other than for savepoints.
- Ashish On Monday, July 23, 2018, 7:45 AM, ashish pok <ashish...@yahoo.com> wrote: Stefan, I did have first point at the back of my mind. I was under the impression though for checkpoints, cleanup would be done by TMs as they are being taken by TMs. So for a standalone cluster with its own zookeeper for JM high availability, a NAS is a must have? We were going to go with local checkpoints with access to remote HDFS for savepoints. This sounds like it will be a bad idea then. Unfortunately we can’t run on YARN and NAS is also a no-no in one of our datacenters - there is a mountain of security complainace to climb before we will in Production if we need to go that route. Thanks, Ashish On Monday, July 23, 2018, 5:10 AM, Stefan Richter <s.rich...@data-artisans.com> wrote: Hi, I am wondering how this can even work properly if you are using a local fs for checkpoints instead of a distributed fs. First, what happens under node failures, if the SSD becomes unavailable or if a task gets scheduled to a different machine, and can no longer access the disk with the corresponding state data, or if you want to scale-out. Second, the same problem is also what you can observe with the job manager: how could the checkpoint coordinator, that runs on the JM, access a file on a local FS on a different node to cleanup the checkpoint data? The purpose of using a distributed fs here is that all TM and the JM can access the checkpoint files. Best, Stefan > Am 22.07.2018 um 19:03 schrieb Ashish Pokharel <ashish...@yahoo.com>: > > All, > > We recently moved our Checkpoint directory from HDFS to local SSDs mounted on > Data Nodes (we were starting to see perf impacts on checkpoints etc as > complex ML apps were spinning up more and more in YARN). This worked great > other than the fact that when jobs are being canceled or canceled with > Savepoint, local data is not being cleaned up. In HDFS, Checkpoint > directories were cleaned up on Cancel and Cancel with Savepoints as far as I > can remember. I am wondering if it is permissions issue. Local disks have RWX > permissions for both yarn and flink headless users (flink headless user > submits the apps to YARN using our CICD pipeline). > > Appreciate any pointers on this. > > Thanks, Ashish