With some more testing I'm seeing one major issue with dmtcp migration that involves migrating between different processor versions (e.g. Xeon 5400 -> 5500). We're running code compiled with the Intel Fortran compiler that is compiled with different code paths for different processors. This appears to be detected once at startup because if a job is migrated from an older processor to a newer processor the job will die with an illegal instruction signal.

There does not appear to be a way to restrict the migration of a job beyond what was already specified in the job submission, correct? I wonder if it would be possible to put more restrictions on a migrating job somehow.

One other little thing, I'm seeing this in the qmaster logs:

10/16/2012 11:24:20|worker|vulcan|W|job 27797.340 failed on host font1lin.cora.nwra.com migrating because: <unknown reason>

But watching the trace file shows no problems:

==> trace <==
10/16/2012 11:24:03 [998:2247]: wait3 returned -1
10/16/2012 11:24:03 [998:2247]: initiate checkpoint due to migration request
10/16/2012 11:24:03 [998:7107]: starting migrate command: /usr/share/gridengine/util/dmtcp_migrate 10/16/2012 11:24:03 [614:7107]: start_as_command: pre_args_ptr[0] = argv0; "/usr/share/gridengine/util/dmtcp_migrate" shell_path = "/bin/sh" 10/16/2012 11:24:03 [614:7107]: execvp(/bin/sh, "/usr/share/gridengine/util/dmtcp_migrate" "-c" "/usr/share/gridengine/util/dmtcp_migrate")

10/16/2012 11:24:20 [998:2247]: wait3 returned 7107 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) 10/16/2012 11:24:20 [998:2247]: if jobs and shepherd do not exit there is some error in the migrate command
10/16/2012 11:24:20 [998:2247]: reaped migration checkpoint command
10/16/2012 11:24:20 [998:2247]: checkpoint command exited normally
10/16/2012 11:24:20 [998:2247]: checkpoint is in the arena after migration 
request
10/16/2012 11:24:20 [998:2247]: wait3 returned 2269 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
10/16/2012 11:24:20 [998:2247]: job exited with exit status 0
10/16/2012 11:24:20 [998:2247]: reaped "job" with pid 2269
10/16/2012 11:24:20 [998:2247]: job exited not due to signal
10/16/2012 11:24:20 [998:2247]: job exited with status 0
10/16/2012 11:24:20 [998:2247]: now sending signal KILL to pid -2269
10/16/2012 11:24:20 [998:2247]: writing usage file to "usage"
10/16/2012 11:24:20 [998:2247]: no tasker to notify
10/16/2012 11:24:20 [998:7119]: child: starting son(epilog, /usr/share/gridengine/util/epilog, 0);
10/16/2012 11:24:20 [998:2247]: parent: forked "epilog" with pid 7119
10/16/2012 11:24:20 [998:2247]: no need to map signal TTOU
10/16/2012 11:24:20 [998:2247]: queued signal TTOU
10/16/2012 11:24:20 [998:2247]: using signal delivery delay of 120 seconds
10/16/2012 11:24:20 [998:2247]: parent: epilog-pid: 7119
10/16/2012 11:24:20 [998:7119]: pid=7119 pgrp=7119 sid=7119 old pgrp=2247 getlogin()=root
10/16/2012 11:24:20 [998:7119]: reading passwd information for user 'graham'
10/16/2012 11:24:20 [998:7119]: setting limits
10/16/2012 11:24:20 [998:7119]: setting environment
10/16/2012 11:24:20 [998:7119]: Initializing error file
10/16/2012 11:24:20 [998:7119]: switching to intermediate/target user
10/16/2012 11:24:20 [614:7119]: closing all filedescriptors
10/16/2012 11:24:20 [614:7119]: further messages are in "error" and "trace"
10/16/2012 11:24:20 [614:7119]: using "/bin/tcsh" as shell of user "graham"
10/16/2012 11:24:20 [614:7119]: now running with uid=614, euid=614
10/16/2012 11:24:20 [614:7119]: execvp(/usr/share/gridengine/util/epilog, "/usr/share/gridengine/util/epilog") 10/16/2012 11:24:20 [998:2247]: wait3 returned 7119 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
10/16/2012 11:24:20 [998:2247]: epilog exited with exit status 0
10/16/2012 11:24:20 [998:2247]: reaped "epilog" with pid 7119
10/16/2012 11:24:20 [998:2247]: epilog exited not due to signal
10/16/2012 11:24:20 [998:2247]: epilog exited with status 0
10/16/2012 11:24:20 [998:2247]: sending SIGTERM to sge_coshepherd

==> exit_status <==
0

Seems like a confusing message.

--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [email protected]
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to