Am 16.10.2012 um 19:50 schrieb Orion Poplawski: > With some more testing I'm seeing one major issue with dmtcp migration that > involves migrating between different processor versions (e.g. Xeon 5400 -> > 5500). We're running code compiled with the Intel Fortran compiler that is > compiled with different code paths for different processors. This appears to > be detected once at startup because if a job is migrated from an older > processor to a newer processor the job will die with an illegal instruction > signal.
I would assume that the compiled application detects only once which CPU type it's running on. You could limit this during compilation I would assume to compile only for 5400 or older. > There does not appear to be a way to restrict the migration of a job beyond > what was already specified in the job submission, correct? I wonder if it > would be possible to put more restrictions on a migrating job somehow. You can use `qalter` in the migration method to request a hostgroup or requesting a string for the CPU type for the job before it's being killed. -- Reuti > One other little thing, I'm seeing this in the qmaster logs: > > 10/16/2012 11:24:20|worker|vulcan|W|job 27797.340 failed on host > font1lin.cora.nwra.com migrating because: <unknown reason> > > But watching the trace file shows no problems: > > ==> trace <== > 10/16/2012 11:24:03 [998:2247]: wait3 returned -1 > 10/16/2012 11:24:03 [998:2247]: initiate checkpoint due to migration request > 10/16/2012 11:24:03 [998:7107]: starting migrate command: > /usr/share/gridengine/util/dmtcp_migrate > 10/16/2012 11:24:03 [614:7107]: start_as_command: pre_args_ptr[0] = argv0; > "/usr/share/gridengine/util/dmtcp_migrate" shell_path = "/bin/sh" > 10/16/2012 11:24:03 [614:7107]: execvp(/bin/sh, > "/usr/share/gridengine/util/dmtcp_migrate" "-c" > "/usr/share/gridengine/util/dmtcp_migrate") > > 10/16/2012 11:24:20 [998:2247]: wait3 returned 7107 (status: 0; WIFSIGNALED: > 0, WIFEXITED: 1, WEXITSTATUS: 0) > 10/16/2012 11:24:20 [998:2247]: if jobs and shepherd do not exit there is > some error in the migrate command > 10/16/2012 11:24:20 [998:2247]: reaped migration checkpoint command > 10/16/2012 11:24:20 [998:2247]: checkpoint command exited normally > 10/16/2012 11:24:20 [998:2247]: checkpoint is in the arena after migration > request > 10/16/2012 11:24:20 [998:2247]: wait3 returned 2269 (status: 0; WIFSIGNALED: > 0, WIFEXITED: 1, WEXITSTATUS: 0) > 10/16/2012 11:24:20 [998:2247]: job exited with exit status 0 > 10/16/2012 11:24:20 [998:2247]: reaped "job" with pid 2269 > 10/16/2012 11:24:20 [998:2247]: job exited not due to signal > 10/16/2012 11:24:20 [998:2247]: job exited with status 0 > 10/16/2012 11:24:20 [998:2247]: now sending signal KILL to pid -2269 > 10/16/2012 11:24:20 [998:2247]: writing usage file to "usage" > 10/16/2012 11:24:20 [998:2247]: no tasker to notify > 10/16/2012 11:24:20 [998:7119]: child: starting son(epilog, > /usr/share/gridengine/util/epilog, 0); > 10/16/2012 11:24:20 [998:2247]: parent: forked "epilog" with pid 7119 > 10/16/2012 11:24:20 [998:2247]: no need to map signal TTOU > 10/16/2012 11:24:20 [998:2247]: queued signal TTOU > 10/16/2012 11:24:20 [998:2247]: using signal delivery delay of 120 seconds > 10/16/2012 11:24:20 [998:2247]: parent: epilog-pid: 7119 > 10/16/2012 11:24:20 [998:7119]: pid=7119 pgrp=7119 sid=7119 old pgrp=2247 > getlogin()=root > 10/16/2012 11:24:20 [998:7119]: reading passwd information for user 'graham' > 10/16/2012 11:24:20 [998:7119]: setting limits > 10/16/2012 11:24:20 [998:7119]: setting environment > 10/16/2012 11:24:20 [998:7119]: Initializing error file > 10/16/2012 11:24:20 [998:7119]: switching to intermediate/target user > 10/16/2012 11:24:20 [614:7119]: closing all filedescriptors > 10/16/2012 11:24:20 [614:7119]: further messages are in "error" and "trace" > 10/16/2012 11:24:20 [614:7119]: using "/bin/tcsh" as shell of user "graham" > 10/16/2012 11:24:20 [614:7119]: now running with uid=614, euid=614 > 10/16/2012 11:24:20 [614:7119]: execvp(/usr/share/gridengine/util/epilog, > "/usr/share/gridengine/util/epilog") > 10/16/2012 11:24:20 [998:2247]: wait3 returned 7119 (status: 0; WIFSIGNALED: > 0, WIFEXITED: 1, WEXITSTATUS: 0) > 10/16/2012 11:24:20 [998:2247]: epilog exited with exit status 0 > 10/16/2012 11:24:20 [998:2247]: reaped "epilog" with pid 7119 > 10/16/2012 11:24:20 [998:2247]: epilog exited not due to signal > 10/16/2012 11:24:20 [998:2247]: epilog exited with status 0 > 10/16/2012 11:24:20 [998:2247]: sending SIGTERM to sge_coshepherd > > ==> exit_status <== > 0 > > Seems like a confusing message. > > -- > Orion Poplawski > Technical Manager 303-415-9701 x222 > NWRA, Boulder Office FAX: 303-415-9702 > 3380 Mitchell Lane [email protected] > Boulder, CO 80301 http://www.nwra.com > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
