Sorry,
Just a follow-up. In absence of NAS then the best option to go with here is 
checkpoint and savepoints both on HDFS and StateBackend using local SSDs then?
We were trying to not even hit HDFS other than for savepoints.


- Ashish

On Monday, July 23, 2018, 7:45 AM, ashish pok <ashish...@yahoo.com> wrote:

Stefan,
I did have first point at the back of my mind. I was under the impression 
though for checkpoints, cleanup would be done by TMs as they are being taken by 
TMs.
So for a standalone cluster with its own zookeeper for JM high availability, a 
NAS is a must have? We were going to go with local checkpoints with access to 
remote HDFS for savepoints. This sounds like it will be a bad idea then. 
Unfortunately we can’t run on YARN and NAS is also a no-no in one of our 
datacenters - there is a mountain of security complainace to climb before we 
will in Production if we need to go that route.
Thanks, Ashish


On Monday, July 23, 2018, 5:10 AM, Stefan Richter <s.rich...@data-artisans.com> 
wrote:

Hi,

I am wondering how this can even work properly if you are using a local fs for 
checkpoints instead of a distributed fs. First, what happens under node 
failures, if the SSD becomes unavailable or if a task gets scheduled to a 
different machine, and can no longer access the disk with the  corresponding 
state data, or if you want to scale-out. Second, the same problem is also what 
you can observe with the job manager: how could the checkpoint coordinator, 
that runs on the JM, access a file on a local FS on a different node to cleanup 
the checkpoint data? The purpose of using a distributed fs here is that all TM 
and the JM can access the checkpoint files.

Best,
Stefan

> Am 22.07.2018 um 19:03 schrieb Ashish Pokharel <ashish...@yahoo.com>:
> 
> All,
> 
> We recently moved our Checkpoint directory from HDFS to local SSDs mounted on 
> Data Nodes (we were starting to see perf impacts on checkpoints etc as 
> complex ML apps were spinning up more and more in YARN). This worked great 
> other than the fact that when jobs are being canceled or canceled with 
> Savepoint, local data is not being cleaned up. In HDFS, Checkpoint 
> directories were cleaned up on Cancel and Cancel with Savepoints as far as I 
> can remember. I am wondering if it is permissions issue. Local disks have RWX 
> permissions for both yarn and flink headless users (flink headless user 
> submits the apps to YARN using our CICD pipeline). 
> 
> Appreciate any pointers on this.
> 
> Thanks, Ashish






Reply via email to