Am 04.04.2012 um 17:42 schrieb Lars van der bijl: > Hey Reuti > > On 4 April 2012 17:14, Reuti <[email protected]> wrote: >> Well, in both cases it is killed of course. You could set loglevel to >> log_info and search the messages file of the qmaster for entries like: >> >> 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 >> rescheduling because: manual/auto rescheduling >> 04/04/2012 17:03:07|worker|pc15370|W|rescheduling job 3963.1 >> 04/04/2012 17:03:46|worker|pc15370|I|reuti has deleted job 396 > > might have to rotate the file before i try and do something like that, > it's currently 117Mb. > >> >> Then you can act on this. Do you have this often, that you want to >> reschedule a job? I wonder whether using a checkpointing environment would >> help (also if we don't intend to use any checkpointing at all). There you >> can have a procedure for migration in migr_command. > > no it's not something I want to happen often but it happens. one thing > i'm still struggling with on a related note is that a task will keep > running even after it is rescheduled. making both of the outputs > useless. > > would we be able to make sure the task is kill -9'd (and it's sub
The default behavior in SGE is: # kill -9 -- -pid This will kill the complete process group due to its negative value. The problem of surviving kids should have been fixed since 6.2u3 as I found recently but sometimes it's still there. > pids) if it's rescheduled using a checkpointing? In fact: you have to do it on your own. SGE will start the migr_command and you have to checkpoint by any means and then kill all processes on your own. You can have a look at my Howto: http://arc.liv.ac.uk/SGE/howto/checkpointing.html and example5 therein. To reschedule a job would then mean to suspend it from the command line which will start the migr_command. -- Reuti >> -- Reuti >> >> >> Am 04.04.2012 um 16:33 schrieb Lars van der bijl: >> >>> is there a way to tell the difference? >>> >>> if i reschedual a job i get these values in the usage file in the epilog >>> >>> wait_status=3727362 >>> exit_status=137 >>> signal=9 >>> start_time=1333549517 >>> end_time=1333549565 >>> ru_wallclock=48 >>> ru_utime=0.226965 >>> ru_stime=0.306953 >>> ru_maxrss=5408 >>> ru_ixrss=0 >>> ru_idrss=0 >>> ru_isrss=0 >>> ru_minflt=40792 >>> ru_majflt=5 >>> ru_nswap=0 >>> ru_inblock=7992 >>> ru_oublock=232 >>> ru_msgsnd=0 >>> ru_msgrcv=0 >>> ru_nsignals=0 >>> ru_nvcsw=3489 >>> ru_nivcsw=113 >>> >>> if i kill the job I get this. >>> >>> wait_status=3727362 >>> exit_status=137 >>> signal=9 >>> start_time=1333549704 >>> end_time=1333549719 >>> ru_wallclock=15 >>> ru_utime=0.196970 >>> ru_stime=0.196970 >>> ru_maxrss=5412 >>> ru_ixrss=0 >>> ru_idrss=0 >>> ru_isrss=0 >>> ru_minflt=40459 >>> ru_majflt=0 >>> ru_nswap=0 >>> ru_inblock=0 >>> ru_oublock=232 >>> ru_msgsnd=0 >>> ru_msgrcv=0 >>> ru_nsignals=0 >>> ru_nvcsw=705 >>> ru_nivcsw=149 >>> >>> anyone know of a way to tell the difference from the epilog? >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
