Re: Blob Server Removes Failed Jobs Immediately

2018-06-21 Thread Till Rohrmann
Hi Dominik,

all job related files (non-HA as well as HA) are removed once the job
reaches a globally terminal state (FINISHED, CANCELLED, FAILED). This is
the case because Flink assumes that the job is done and won't be retried
afterwards. Thus, the documentation in the Flip is not true and should be
corrected.

Cheers,
Till

On Wed, Jun 20, 2018 at 7:11 PM Chesnay Schepler  wrote:

> hmm, this indeed looks odd. Looping in Till (cc) who might know more about
> this.
>
> On 20.06.2018 16:43, Dominik Wosiński wrote:
>
> Hello,
>
> I'm not sure whether the problem is connected with bad configuration or
> it's some inconsistency in the documentation but according to this document:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture
> . *I*f a job fails, all non-HA files' refCounts are reset to 0; all HA *files'
> refCounts remain and will not be increased again on recovery. *But in the
> JobManager's code if the Job Status is changed to failed and the JobManager
> receive the message with that fact, it will send *RemoveJob* message to
> itself, which invokes *removeJob() *function that always invokes
> following functions :
>
> libraryCacheManager.unregisterJob(jobID)
> blobServer.cleanupJob(jobID, removeJobFromStateBackend)
>
> jobManagerMetricGroup.removeJob(jobID)
>
> As far as I understand this removes blob entries immediately. And
> according to the doc it should only freeze refCounts for HA files and reset
> refCounts for non-Ha files to allow their later removal.
> Is the doc right and I have missed something here ?
> Thanks in Advance.
>
>
>


Re: Blob Server Removes Failed Jobs Immediately

2018-06-20 Thread Chesnay Schepler
hmm, this indeed looks odd. Looping in Till (cc) who might know more 
about this.


On 20.06.2018 16:43, Dominik Wosiński wrote:

Hello,

I'm not sure whether the problem is connected with bad configuration 
or it's some inconsistency in the documentation but according to this 
document:https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture. 
*I*f a job fails, all|non-HA|files' refCounts are reset to 0; 
all|HA|*files' refCounts remain and will not be increased again on 
recovery. *But in the JobManager's code if the Job Status is changed 
to failed and the JobManager receive the message with that fact, it 
will send /RemoveJob/ message to itself, which invokes /removeJob() 
/function that always invokes following functions :

libraryCacheManager.unregisterJob(jobID)
blobServer.cleanupJob(jobID, removeJobFromStateBackend)

jobManagerMetricGroup.removeJob(jobID)
As far as I understand this removes blob entries immediately. And 
according to the doc it should only freeze refCounts for HA files and 
reset refCounts for non-Ha files to allow their later removal.

Is the doc right and I have missed something here ?
Thanks in Advance.





Blob Server Removes Failed Jobs Immediately

2018-06-20 Thread Dominik Wosiński
Hello,

I'm not sure whether the problem is connected with bad configuration or
it's some inconsistency in the documentation but according to this document:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture
. *I*f a job fails, all non-HA files' refCounts are reset to 0; all HA *files'
refCounts remain and will not be increased again on recovery. *But in the
JobManager's code if the Job Status is changed to failed and the JobManager
receive the message with that fact, it will send *RemoveJob* message to
itself, which invokes *removeJob() *function that always invokes following
functions :

libraryCacheManager.unregisterJob(jobID)
blobServer.cleanupJob(jobID, removeJobFromStateBackend)

jobManagerMetricGroup.removeJob(jobID)

As far as I understand this removes blob entries immediately. And according
to the doc it should only freeze refCounts for HA files and reset refCounts
for non-Ha files to allow their later removal.
Is the doc right and I have missed something here ?
Thanks in Advance.