in our case the application has no checkpointing capabilities. for us a reschedule is just run from start on a new host.
so a checkpoint with a signal 9 should be enough? On 4 April 2012 17:50, Reuti <[email protected]> wrote: > Am 04.04.2012 um 17:42 schrieb Lars van der bijl: > >> Hey Reuti >> >> On 4 April 2012 17:14, Reuti <[email protected]> wrote: >>> Well, in both cases it is killed of course. You could set loglevel to >>> log_info and search the messages file of the qmaster for entries like: >>> >>> 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 >>> rescheduling because: manual/auto rescheduling >>> 04/04/2012 17:03:07|worker|pc15370|W|rescheduling job 3963.1 >>> 04/04/2012 17:03:46|worker|pc15370|I|reuti has deleted job 396 >> >> might have to rotate the file before i try and do something like that, >> it's currently 117Mb. >> >>> >>> Then you can act on this. Do you have this often, that you want to >>> reschedule a job? I wonder whether using a checkpointing environment would >>> help (also if we don't intend to use any checkpointing at all). There you >>> can have a procedure for migration in migr_command. >> >> no it's not something I want to happen often but it happens. one thing >> i'm still struggling with on a related note is that a task will keep >> running even after it is rescheduled. making both of the outputs >> useless. >> >> would we be able to make sure the task is kill -9'd (and it's sub > > The default behavior in SGE is: > > # kill -9 -- -pid > > This will kill the complete process group due to its negative value. The > problem of surviving kids should have been fixed since 6.2u3 as I found > recently but sometimes it's still there. > > >> pids) if it's rescheduled using a checkpointing? > > In fact: you have to do it on your own. SGE will start the migr_command and > you have to checkpoint by any means and then kill all processes on your own. > You can have a look at my Howto: > > http://arc.liv.ac.uk/SGE/howto/checkpointing.html > > and example5 therein. To reschedule a job would then mean to suspend it from > the command line which will start the migr_command. > > -- Reuti > > >>> -- Reuti >>> >>> >>> Am 04.04.2012 um 16:33 schrieb Lars van der bijl: >>> >>>> is there a way to tell the difference? >>>> >>>> if i reschedual a job i get these values in the usage file in the epilog >>>> >>>> wait_status=3727362 >>>> exit_status=137 >>>> signal=9 >>>> start_time=1333549517 >>>> end_time=1333549565 >>>> ru_wallclock=48 >>>> ru_utime=0.226965 >>>> ru_stime=0.306953 >>>> ru_maxrss=5408 >>>> ru_ixrss=0 >>>> ru_idrss=0 >>>> ru_isrss=0 >>>> ru_minflt=40792 >>>> ru_majflt=5 >>>> ru_nswap=0 >>>> ru_inblock=7992 >>>> ru_oublock=232 >>>> ru_msgsnd=0 >>>> ru_msgrcv=0 >>>> ru_nsignals=0 >>>> ru_nvcsw=3489 >>>> ru_nivcsw=113 >>>> >>>> if i kill the job I get this. >>>> >>>> wait_status=3727362 >>>> exit_status=137 >>>> signal=9 >>>> start_time=1333549704 >>>> end_time=1333549719 >>>> ru_wallclock=15 >>>> ru_utime=0.196970 >>>> ru_stime=0.196970 >>>> ru_maxrss=5412 >>>> ru_ixrss=0 >>>> ru_idrss=0 >>>> ru_isrss=0 >>>> ru_minflt=40459 >>>> ru_majflt=0 >>>> ru_nswap=0 >>>> ru_inblock=0 >>>> ru_oublock=232 >>>> ru_msgsnd=0 >>>> ru_msgrcv=0 >>>> ru_nsignals=0 >>>> ru_nvcsw=705 >>>> ru_nivcsw=149 >>>> >>>> anyone know of a way to tell the difference from the epilog? >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>> >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
