Re: Job fails with FileNotFoundException from blobStore

Stephan Ewen Thu, 05 Feb 2015 07:25:03 -0800

I think that process killing (HALT signal) is a very typical way in Linux
to shut down processes. It is the most robust way, since it does not
require to send any custom messages to the process.


This is sort of graceful, as the JVM gets the signal and may do a lot of
things before shutting down, such as running shutdown hooks. The ungraceful
variant is the KILL signal, which just removes the process.



On Thu, Feb 5, 2015 at 4:16 PM, Till Rohrmann <trohrm...@apache.org> wrote:

> Hmm this is not very gentleman-like to terminate the Job/TaskManagers.
> I'll check how the ActorSystem behaves in case of killing the process.
>
> Why can't we implement a more graceful termination mechanism? For example,
> we could send a termination message to the JobManager and TaskManagers.
>
> On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <u...@apache.org> wrote:
>
>> Thank you very much, Robert!
>>
>> The problem is that the job/task manager shutdown methods are never
>> called. When using the scripts, the task/job manager processes get killed
>> and therefore shutdown methods are never called.
>>
>> @Till: Do you know whether there is a mechanism in Akka to register the
>> actors for JVM shutdown hooks? I tried to register a shutdown hook via
>> Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a
>> reference to the task manager.
>>
>>
>> On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Robert,
>>>
>>> thanks for the info. If the TaskManager/JobManager does not shutdown
>>> properly, i.e. killing of the process, then it is indeed the case that the
>>> BlobManager cannot properly remove all stored files. I don't know if this
>>> was lately the case for you. Furthermore, the files are not directly
>>> deleted after the job has finished. Internally there is a cleanup task
>>> which is triggered every our and deletes all blobs which are no longer
>>> referenced.
>>>
>>> But we definitely have to look into it to see how we could improve this
>>> behaviour.
>>>
>>> Greets,
>>>
>>> Till
>>>
>>> On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <
>>> robert.wa...@googlemail.com> wrote:
>>>
>>>> I talked with the admins. The problem seemed to have been that the disk
>>>> was full and Flink couldn't create the directory.
>>>>
>>>> Maybe the the error message should reflect if that is the cause.
>>>>
>>>> While cleaning up the disk we noticed that a lot of temporary blobStore
>>>> files were not deleted by Flink after the job finished. This seemed to have
>>>> caused or at least worsened the problem.
>>>>
>>>> Cheers,
>>>> Robert
>>>>
>>>> On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <u...@apache.org> wrote:
>>>>
>>>>> On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <
>>>>> robert.wa...@googlemail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I can reproduce the error on my cluster.
>>>>>>
>>>>>> Unfortunately I can't check whether the parent directories were
>>>>>> created on the different nodes since I have no way of accessing them. I
>>>>>> start all the jobs from a gateway.
>>>>>>
>>>>>
>>>>> I've added a check to the directory creation (in branches release-0.8
>>>>> and master), which should fail with a proper error message if that is the
>>>>> problem. If you have time to (re)deploy Flink, it would be great to know 
>>>>> if
>>>>> that indeed is the issue. Otherwise, we need to further investigate this.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job fails with FileNotFoundException from blobStore

Reply via email to