[gridengine users] Problems with dmtcp migration between processor versions

Orion Poplawski Tue, 16 Oct 2012 10:52:26 -0700

With some more testing I'm seeing one major issue with dmtcp migration thatinvolves migrating between different processor versions (e.g. Xeon 5400 ->5500). We're running code compiled with the Intel Fortran compiler that iscompiled with different code paths for different processors. This appears tobe detected once at startup because if a job is migrated from an olderprocessor to a newer processor the job will die with an illegal instructionsignal.

There does not appear to be a way to restrict the migration of a job beyondwhat was already specified in the job submission, correct? I wonder if itwould be possible to put more restrictions on a migrating job somehow.


One other little thing, I'm seeing this in the qmaster logs:

10/16/2012 11:24:20|worker|vulcan|W|job 27797.340 failed on hostfont1lin.cora.nwra.com migrating because: <unknown reason>


But watching the trace file shows no problems:

==> trace <==
10/16/2012 11:24:03 [998:2247]: wait3 returned -1
10/16/2012 11:24:03 [998:2247]: initiate checkpoint due to migration request

10/16/2012 11:24:03 [998:7107]: starting migrate command:/usr/share/gridengine/util/dmtcp_migrate10/16/2012 11:24:03 [614:7107]: start_as_command: pre_args_ptr[0] = argv0;"/usr/share/gridengine/util/dmtcp_migrate" shell_path = "/bin/sh"10/16/2012 11:24:03 [614:7107]: execvp(/bin/sh,"/usr/share/gridengine/util/dmtcp_migrate" "-c""/usr/share/gridengine/util/dmtcp_migrate")

10/16/2012 11:24:20 [998:2247]: wait3 returned 7107 (status: 0; WIFSIGNALED:0, WIFEXITED: 1, WEXITSTATUS: 0)10/16/2012 11:24:20 [998:2247]: if jobs and shepherd do not exit there is someerror in the migrate command

10/16/2012 11:24:20 [998:2247]: reaped migration checkpoint command
10/16/2012 11:24:20 [998:2247]: checkpoint command exited normally
10/16/2012 11:24:20 [998:2247]: checkpoint is in the arena after migration 
request

10/16/2012 11:24:20 [998:2247]: wait3 returned 2269 (status: 0; WIFSIGNALED:0, WIFEXITED: 1, WEXITSTATUS: 0)

10/16/2012 11:24:20 [998:2247]: job exited with exit status 0
10/16/2012 11:24:20 [998:2247]: reaped "job" with pid 2269
10/16/2012 11:24:20 [998:2247]: job exited not due to signal
10/16/2012 11:24:20 [998:2247]: job exited with status 0
10/16/2012 11:24:20 [998:2247]: now sending signal KILL to pid -2269
10/16/2012 11:24:20 [998:2247]: writing usage file to "usage"
10/16/2012 11:24:20 [998:2247]: no tasker to notify

10/16/2012 11:24:20 [998:7119]: child: starting son(epilog,/usr/share/gridengine/util/epilog, 0);

10/16/2012 11:24:20 [998:2247]: parent: forked "epilog" with pid 7119
10/16/2012 11:24:20 [998:2247]: no need to map signal TTOU
10/16/2012 11:24:20 [998:2247]: queued signal TTOU
10/16/2012 11:24:20 [998:2247]: using signal delivery delay of 120 seconds
10/16/2012 11:24:20 [998:2247]: parent: epilog-pid: 7119

10/16/2012 11:24:20 [998:7119]: pid=7119 pgrp=7119 sid=7119 old pgrp=2247getlogin()=root

10/16/2012 11:24:20 [998:7119]: reading passwd information for user 'graham'
10/16/2012 11:24:20 [998:7119]: setting limits
10/16/2012 11:24:20 [998:7119]: setting environment
10/16/2012 11:24:20 [998:7119]: Initializing error file
10/16/2012 11:24:20 [998:7119]: switching to intermediate/target user
10/16/2012 11:24:20 [614:7119]: closing all filedescriptors
10/16/2012 11:24:20 [614:7119]: further messages are in "error" and "trace"
10/16/2012 11:24:20 [614:7119]: using "/bin/tcsh" as shell of user "graham"
10/16/2012 11:24:20 [614:7119]: now running with uid=614, euid=614

10/16/2012 11:24:20 [614:7119]: execvp(/usr/share/gridengine/util/epilog,"/usr/share/gridengine/util/epilog")10/16/2012 11:24:20 [998:2247]: wait3 returned 7119 (status: 0; WIFSIGNALED:0, WIFEXITED: 1, WEXITSTATUS: 0)

10/16/2012 11:24:20 [998:2247]: epilog exited with exit status 0
10/16/2012 11:24:20 [998:2247]: reaped "epilog" with pid 7119
10/16/2012 11:24:20 [998:2247]: epilog exited not due to signal
10/16/2012 11:24:20 [998:2247]: epilog exited with status 0
10/16/2012 11:24:20 [998:2247]: sending SIGTERM to sge_coshepherd

==> exit_status <==
0

Seems like a confusing message.

--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [email protected]
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Problems with dmtcp migration between processor versions

Reply via email to