hmm, this indeed looks odd. Looping in Till (cc) who might know more
about this.
On 20.06.2018 16:43, Dominik Wosiński wrote:
Hello,
I'm not sure whether the problem is connected with bad configuration
or it's some inconsistency in the documentation but according to this
document:https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture.
*I*f a job fails, all|non-HA|files' refCounts are reset to 0;
all|HA|*files' refCounts remain and will not be increased again on
recovery. *But in the JobManager's code if the Job Status is changed
to failed and the JobManager receive the message with that fact, it
will send /RemoveJob/ message to itself, which invokes /removeJob()
/function that always invokes following functions :
libraryCacheManager.unregisterJob(jobID)
blobServer.cleanupJob(jobID, removeJobFromStateBackend)
jobManagerMetricGroup.removeJob(jobID)
As far as I understand this removes blob entries immediately. And
according to the doc it should only freeze refCounts for HA files and
reset refCounts for non-Ha files to allow their later removal.
Is the doc right and I have missed something here ?
Thanks in Advance.