Re: Blob Server Removes Failed Jobs Immediately

Chesnay Schepler Wed, 20 Jun 2018 10:11:29 -0700

hmm, this indeed looks odd. Looping in Till (cc) who might know moreabout this.


On 20.06.2018 16:43, Dominik Wosiński wrote:

Hello,
I'm not sure whether the problem is connected with bad configurationor it's some inconsistency in the documentation but according to thisdocument:https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture.*I*f a job fails, all|non-HA|files' refCounts are reset to 0;all|HA|*files' refCounts remain and will not be increased again onrecovery. *But in the JobManager's code if the Job Status is changedto failed and the JobManager receive the message with that fact, itwill send /RemoveJob/ message to itself, which invokes /removeJob()/function that always invokes following functions :
libraryCacheManager.unregisterJob(jobID)
blobServer.cleanupJob(jobID, removeJobFromStateBackend)

jobManagerMetricGroup.removeJob(jobID)
As far as I understand this removes blob entries immediately. Andaccording to the doc it should only freeze refCounts for HA files andreset refCounts for non-Ha files to allow their later removal.
Is the doc right and I have missed something here ?
Thanks in Advance.

Re: Blob Server Removes Failed Jobs Immediately

Reply via email to