hmm, this indeed looks odd. Looping in Till (cc) who might know more about this.

On 20.06.2018 16:43, Dominik Wosiński wrote:
Hello,

I'm not sure whether the problem is connected with bad configuration or it's some inconsistency in the documentation but according to this document:https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture. *I*f a job fails, all|non-HA|files' refCounts are reset to 0; all|HA|*files' refCounts remain and will not be increased again on recovery. *But in the JobManager's code if the Job Status is changed to failed and the JobManager receive the message with that fact, it will send /RemoveJob/ message to itself, which invokes /removeJob() /function that always invokes following functions :
libraryCacheManager.unregisterJob(jobID)
blobServer.cleanupJob(jobID, removeJobFromStateBackend)

jobManagerMetricGroup.removeJob(jobID)
As far as I understand this removes blob entries immediately. And according to the doc it should only freeze refCounts for HA files and reset refCounts for non-Ha files to allow their later removal.
Is the doc right and I have missed something here ?
Thanks in Advance.


Reply via email to