With some more testing I'm seeing one major issue with dmtcp migration that
involves migrating between different processor versions (e.g. Xeon 5400 ->
5500). We're running code compiled with the Intel Fortran compiler that is
compiled with different code paths for different processors. This appears to
be detected once at startup because if a job is migrated from an older
processor to a newer processor the job will die with an illegal instruction
signal.
There does not appear to be a way to restrict the migration of a job beyond
what was already specified in the job submission, correct? I wonder if it
would be possible to put more restrictions on a migrating job somehow.
One other little thing, I'm seeing this in the qmaster logs:
10/16/2012 11:24:20|worker|vulcan|W|job 27797.340 failed on host
font1lin.cora.nwra.com migrating because: <unknown reason>
But watching the trace file shows no problems:
==> trace <==
10/16/2012 11:24:03 [998:2247]: wait3 returned -1
10/16/2012 11:24:03 [998:2247]: initiate checkpoint due to migration request
10/16/2012 11:24:03 [998:7107]: starting migrate command:
/usr/share/gridengine/util/dmtcp_migrate
10/16/2012 11:24:03 [614:7107]: start_as_command: pre_args_ptr[0] = argv0;
"/usr/share/gridengine/util/dmtcp_migrate" shell_path = "/bin/sh"
10/16/2012 11:24:03 [614:7107]: execvp(/bin/sh,
"/usr/share/gridengine/util/dmtcp_migrate" "-c"
"/usr/share/gridengine/util/dmtcp_migrate")
10/16/2012 11:24:20 [998:2247]: wait3 returned 7107 (status: 0; WIFSIGNALED:
0, WIFEXITED: 1, WEXITSTATUS: 0)
10/16/2012 11:24:20 [998:2247]: if jobs and shepherd do not exit there is some
error in the migrate command
10/16/2012 11:24:20 [998:2247]: reaped migration checkpoint command
10/16/2012 11:24:20 [998:2247]: checkpoint command exited normally
10/16/2012 11:24:20 [998:2247]: checkpoint is in the arena after migration
request
10/16/2012 11:24:20 [998:2247]: wait3 returned 2269 (status: 0; WIFSIGNALED:
0, WIFEXITED: 1, WEXITSTATUS: 0)
10/16/2012 11:24:20 [998:2247]: job exited with exit status 0
10/16/2012 11:24:20 [998:2247]: reaped "job" with pid 2269
10/16/2012 11:24:20 [998:2247]: job exited not due to signal
10/16/2012 11:24:20 [998:2247]: job exited with status 0
10/16/2012 11:24:20 [998:2247]: now sending signal KILL to pid -2269
10/16/2012 11:24:20 [998:2247]: writing usage file to "usage"
10/16/2012 11:24:20 [998:2247]: no tasker to notify
10/16/2012 11:24:20 [998:7119]: child: starting son(epilog,
/usr/share/gridengine/util/epilog, 0);
10/16/2012 11:24:20 [998:2247]: parent: forked "epilog" with pid 7119
10/16/2012 11:24:20 [998:2247]: no need to map signal TTOU
10/16/2012 11:24:20 [998:2247]: queued signal TTOU
10/16/2012 11:24:20 [998:2247]: using signal delivery delay of 120 seconds
10/16/2012 11:24:20 [998:2247]: parent: epilog-pid: 7119
10/16/2012 11:24:20 [998:7119]: pid=7119 pgrp=7119 sid=7119 old pgrp=2247
getlogin()=root
10/16/2012 11:24:20 [998:7119]: reading passwd information for user 'graham'
10/16/2012 11:24:20 [998:7119]: setting limits
10/16/2012 11:24:20 [998:7119]: setting environment
10/16/2012 11:24:20 [998:7119]: Initializing error file
10/16/2012 11:24:20 [998:7119]: switching to intermediate/target user
10/16/2012 11:24:20 [614:7119]: closing all filedescriptors
10/16/2012 11:24:20 [614:7119]: further messages are in "error" and "trace"
10/16/2012 11:24:20 [614:7119]: using "/bin/tcsh" as shell of user "graham"
10/16/2012 11:24:20 [614:7119]: now running with uid=614, euid=614
10/16/2012 11:24:20 [614:7119]: execvp(/usr/share/gridengine/util/epilog,
"/usr/share/gridengine/util/epilog")
10/16/2012 11:24:20 [998:2247]: wait3 returned 7119 (status: 0; WIFSIGNALED:
0, WIFEXITED: 1, WEXITSTATUS: 0)
10/16/2012 11:24:20 [998:2247]: epilog exited with exit status 0
10/16/2012 11:24:20 [998:2247]: reaped "epilog" with pid 7119
10/16/2012 11:24:20 [998:2247]: epilog exited not due to signal
10/16/2012 11:24:20 [998:2247]: epilog exited with status 0
10/16/2012 11:24:20 [998:2247]: sending SIGTERM to sge_coshepherd
==> exit_status <==
0
Seems like a confusing message.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users