Hello flink community! I have some troubles running flink in yarn. We have servers with many SSD data disks(16 per server). When running jobs in yarn, flink configures temporary directories, for example:
2025-06-05 17:13:54,133 INFO org.apache.flink.runtime.clusterframework.BootstrapTools [] - Setting directories for temporary files to: /data1/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data2/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data3/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data4/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data5/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data6/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data7/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data8/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data9/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data10/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data11/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data12/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data13/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data14/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data15/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549, /data16/hadoop-yarn/nm-local/usercache/flink/appcache/application_1746738689296_320549 But for rocksdb flink uses only one random disk(folder). This behavior has 2 drawbacks: 1. single disk fills up very quickly (we have heavy streaming jobs with large state) 2. disk is overloaded while other 15 are idle I can not find any option to change this behaviour, only to set option state.backend.rocksdb.localdir manually, but in this case when flink is restarted or killed i have garbage on disks, which i should clean manually. In case of many clusters in yarn it is very hard to identify what directories are still alive and required and what are garbage. Is there any good solution? -- With best regards, Artem Syrovatskiy