in our case the application has no checkpointing capabilities. for us
a reschedule is just run from start on a new host.

so a checkpoint with a signal 9 should be enough?


On 4 April 2012 17:50, Reuti <[email protected]> wrote:
> Am 04.04.2012 um 17:42 schrieb Lars van der bijl:
>
>> Hey Reuti
>>
>> On 4 April 2012 17:14, Reuti <[email protected]> wrote:
>>> Well, in both cases it is killed of course. You could set loglevel to 
>>> log_info and search the messages file of the qmaster for entries like:
>>>
>>> 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 
>>> rescheduling because: manual/auto rescheduling
>>> 04/04/2012 17:03:07|worker|pc15370|W|rescheduling job 3963.1
>>> 04/04/2012 17:03:46|worker|pc15370|I|reuti has deleted job 396
>>
>> might have to rotate the file before i try and do something like that,
>> it's currently 117Mb.
>>
>>>
>>> Then you can act on this. Do you have this often, that you want to 
>>> reschedule a job? I wonder whether using a checkpointing environment would 
>>> help (also if we don't intend to use any checkpointing at all). There you 
>>> can have a procedure for migration in migr_command.
>>
>> no it's not something I want to happen often but it happens. one thing
>> i'm still struggling with on a related note is that a task will keep
>> running even after it is rescheduled. making both of the outputs
>> useless.
>>
>> would we be able to make sure the task is kill -9'd (and it's sub
>
> The default behavior in SGE is:
>
> # kill -9 -- -pid
>
> This will kill the complete process group due to its negative value. The 
> problem of surviving kids should have been fixed since 6.2u3 as I found 
> recently but sometimes it's still there.
>
>
>> pids) if it's rescheduled using a checkpointing?
>
> In fact: you have to do it on your own. SGE will start the migr_command and 
> you have to checkpoint by any means and then kill all processes on your own. 
> You can have a look at my Howto:
>
> http://arc.liv.ac.uk/SGE/howto/checkpointing.html
>
> and example5 therein. To reschedule a job would then mean to suspend it from 
> the command line which will start the migr_command.
>
> -- Reuti
>
>
>>> -- Reuti
>>>
>>>
>>> Am 04.04.2012 um 16:33 schrieb Lars van der bijl:
>>>
>>>> is there a way to tell the difference?
>>>>
>>>> if i reschedual a job i get these values in the usage file in the epilog
>>>>
>>>> wait_status=3727362
>>>> exit_status=137
>>>> signal=9
>>>> start_time=1333549517
>>>> end_time=1333549565
>>>> ru_wallclock=48
>>>> ru_utime=0.226965
>>>> ru_stime=0.306953
>>>> ru_maxrss=5408
>>>> ru_ixrss=0
>>>> ru_idrss=0
>>>> ru_isrss=0
>>>> ru_minflt=40792
>>>> ru_majflt=5
>>>> ru_nswap=0
>>>> ru_inblock=7992
>>>> ru_oublock=232
>>>> ru_msgsnd=0
>>>> ru_msgrcv=0
>>>> ru_nsignals=0
>>>> ru_nvcsw=3489
>>>> ru_nivcsw=113
>>>>
>>>> if i kill the job I get this.
>>>>
>>>> wait_status=3727362
>>>> exit_status=137
>>>> signal=9
>>>> start_time=1333549704
>>>> end_time=1333549719
>>>> ru_wallclock=15
>>>> ru_utime=0.196970
>>>> ru_stime=0.196970
>>>> ru_maxrss=5412
>>>> ru_ixrss=0
>>>> ru_idrss=0
>>>> ru_isrss=0
>>>> ru_minflt=40459
>>>> ru_majflt=0
>>>> ru_nswap=0
>>>> ru_inblock=0
>>>> ru_oublock=232
>>>> ru_msgsnd=0
>>>> ru_msgrcv=0
>>>> ru_nsignals=0
>>>> ru_nvcsw=705
>>>> ru_nivcsw=149
>>>>
>>>> anyone know of a way to tell the difference from the epilog?
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to