Re: Blob Server Removes Failed Jobs Immediately
Hi Dominik, all job related files (non-HA as well as HA) are removed once the job reaches a globally terminal state (FINISHED, CANCELLED, FAILED). This is the case because Flink assumes that the job is done and won't be retried afterwards. Thus, the documentation in the Flip is not true and should be corrected. Cheers, Till On Wed, Jun 20, 2018 at 7:11 PM Chesnay Schepler wrote: > hmm, this indeed looks odd. Looping in Till (cc) who might know more about > this. > > On 20.06.2018 16:43, Dominik Wosiński wrote: > > Hello, > > I'm not sure whether the problem is connected with bad configuration or > it's some inconsistency in the documentation but according to this document: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture > . *I*f a job fails, all non-HA files' refCounts are reset to 0; all HA *files' > refCounts remain and will not be increased again on recovery. *But in the > JobManager's code if the Job Status is changed to failed and the JobManager > receive the message with that fact, it will send *RemoveJob* message to > itself, which invokes *removeJob() *function that always invokes > following functions : > > libraryCacheManager.unregisterJob(jobID) > blobServer.cleanupJob(jobID, removeJobFromStateBackend) > > jobManagerMetricGroup.removeJob(jobID) > > As far as I understand this removes blob entries immediately. And > according to the doc it should only freeze refCounts for HA files and reset > refCounts for non-Ha files to allow their later removal. > Is the doc right and I have missed something here ? > Thanks in Advance. > > >
Re: Blob Server Removes Failed Jobs Immediately
hmm, this indeed looks odd. Looping in Till (cc) who might know more about this. On 20.06.2018 16:43, Dominik Wosiński wrote: Hello, I'm not sure whether the problem is connected with bad configuration or it's some inconsistency in the documentation but according to this document:https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture. *I*f a job fails, all|non-HA|files' refCounts are reset to 0; all|HA|*files' refCounts remain and will not be increased again on recovery. *But in the JobManager's code if the Job Status is changed to failed and the JobManager receive the message with that fact, it will send /RemoveJob/ message to itself, which invokes /removeJob() /function that always invokes following functions : libraryCacheManager.unregisterJob(jobID) blobServer.cleanupJob(jobID, removeJobFromStateBackend) jobManagerMetricGroup.removeJob(jobID) As far as I understand this removes blob entries immediately. And according to the doc it should only freeze refCounts for HA files and reset refCounts for non-Ha files to allow their later removal. Is the doc right and I have missed something here ? Thanks in Advance.
Blob Server Removes Failed Jobs Immediately
Hello, I'm not sure whether the problem is connected with bad configuration or it's some inconsistency in the documentation but according to this document: https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture . *I*f a job fails, all non-HA files' refCounts are reset to 0; all HA *files' refCounts remain and will not be increased again on recovery. *But in the JobManager's code if the Job Status is changed to failed and the JobManager receive the message with that fact, it will send *RemoveJob* message to itself, which invokes *removeJob() *function that always invokes following functions : libraryCacheManager.unregisterJob(jobID) blobServer.cleanupJob(jobID, removeJobFromStateBackend) jobManagerMetricGroup.removeJob(jobID) As far as I understand this removes blob entries immediately. And according to the doc it should only freeze refCounts for HA files and reset refCounts for non-Ha files to allow their later removal. Is the doc right and I have missed something here ? Thanks in Advance.