Hi: After enabling Flink’s HistoryServer, we observed that different ways of stopping a running job lead to different results in the HistoryServer:
If we use cancel or stop with savepoint, in most cases the HistoryServer can display the basic information of the job (for example checkpoint status, exceptions, DAG, etc.). But if we kill the job via YARN kill, then definitely that job’s history is not visible in the HistoryServer. This difference causes us some trouble: we expect to reliably obtain historical job information, but at present there is no shutdown method that 100% guarantees that the job history will appear in the HistoryServer. I roughly understand the underlying mechanism, but I don’t know why the community designed it this way. This uncertainty complicates our upper-level job operation & maintenance platform.
