Re: [gridengine users] Problems with dmtcp migration between processor versions

Reuti Tue, 16 Oct 2012 11:21:57 -0700

Am 16.10.2012 um 19:50 schrieb Orion Poplawski:

> With some more testing I'm seeing one major issue with dmtcp migration that 
> involves migrating between different processor versions (e.g. Xeon 5400 -> 
> 5500).  We're running code compiled with the Intel Fortran compiler that is 
> compiled with different code paths for different processors.  This appears to 
> be detected once at startup because if a job is migrated from an older 
> processor to a newer processor the job will die with an illegal instruction 
> signal.


I would assume that the compiled application detects only once which CPU type 
it's running on. You could limit this during compilation I would assume to 
compile only for 5400 or older.


> There does not appear to be a way to restrict the migration of a job beyond 
> what was already specified in the job submission, correct?  I wonder if it 
> would be possible to put more restrictions on a migrating job somehow.

You can use `qalter` in the migration method to request a hostgroup or 
requesting a string for the CPU type for the job before it's being killed.

-- Reuti


> One other little thing, I'm seeing this in the qmaster logs:
> 
> 10/16/2012 11:24:20|worker|vulcan|W|job 27797.340 failed on host 
> font1lin.cora.nwra.com migrating because: <unknown reason>
> 
> But watching the trace file shows no problems:
> 
> ==> trace <==
> 10/16/2012 11:24:03 [998:2247]: wait3 returned -1
> 10/16/2012 11:24:03 [998:2247]: initiate checkpoint due to migration request
> 10/16/2012 11:24:03 [998:7107]: starting migrate command: 
> /usr/share/gridengine/util/dmtcp_migrate
> 10/16/2012 11:24:03 [614:7107]: start_as_command: pre_args_ptr[0] = argv0; 
> "/usr/share/gridengine/util/dmtcp_migrate" shell_path = "/bin/sh"
> 10/16/2012 11:24:03 [614:7107]: execvp(/bin/sh, 
> "/usr/share/gridengine/util/dmtcp_migrate" "-c" 
> "/usr/share/gridengine/util/dmtcp_migrate")
> 
> 10/16/2012 11:24:20 [998:2247]: wait3 returned 7107 (status: 0; WIFSIGNALED: 
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 10/16/2012 11:24:20 [998:2247]: if jobs and shepherd do not exit there is 
> some error in the migrate command
> 10/16/2012 11:24:20 [998:2247]: reaped migration checkpoint command
> 10/16/2012 11:24:20 [998:2247]: checkpoint command exited normally
> 10/16/2012 11:24:20 [998:2247]: checkpoint is in the arena after migration 
> request
> 10/16/2012 11:24:20 [998:2247]: wait3 returned 2269 (status: 0; WIFSIGNALED: 
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 10/16/2012 11:24:20 [998:2247]: job exited with exit status 0
> 10/16/2012 11:24:20 [998:2247]: reaped "job" with pid 2269
> 10/16/2012 11:24:20 [998:2247]: job exited not due to signal
> 10/16/2012 11:24:20 [998:2247]: job exited with status 0
> 10/16/2012 11:24:20 [998:2247]: now sending signal KILL to pid -2269
> 10/16/2012 11:24:20 [998:2247]: writing usage file to "usage"
> 10/16/2012 11:24:20 [998:2247]: no tasker to notify
> 10/16/2012 11:24:20 [998:7119]: child: starting son(epilog, 
> /usr/share/gridengine/util/epilog, 0);
> 10/16/2012 11:24:20 [998:2247]: parent: forked "epilog" with pid 7119
> 10/16/2012 11:24:20 [998:2247]: no need to map signal TTOU
> 10/16/2012 11:24:20 [998:2247]: queued signal TTOU
> 10/16/2012 11:24:20 [998:2247]: using signal delivery delay of 120 seconds
> 10/16/2012 11:24:20 [998:2247]: parent: epilog-pid: 7119
> 10/16/2012 11:24:20 [998:7119]: pid=7119 pgrp=7119 sid=7119 old pgrp=2247 
> getlogin()=root
> 10/16/2012 11:24:20 [998:7119]: reading passwd information for user 'graham'
> 10/16/2012 11:24:20 [998:7119]: setting limits
> 10/16/2012 11:24:20 [998:7119]: setting environment
> 10/16/2012 11:24:20 [998:7119]: Initializing error file
> 10/16/2012 11:24:20 [998:7119]: switching to intermediate/target user
> 10/16/2012 11:24:20 [614:7119]: closing all filedescriptors
> 10/16/2012 11:24:20 [614:7119]: further messages are in "error" and "trace"
> 10/16/2012 11:24:20 [614:7119]: using "/bin/tcsh" as shell of user "graham"
> 10/16/2012 11:24:20 [614:7119]: now running with uid=614, euid=614
> 10/16/2012 11:24:20 [614:7119]: execvp(/usr/share/gridengine/util/epilog, 
> "/usr/share/gridengine/util/epilog")
> 10/16/2012 11:24:20 [998:2247]: wait3 returned 7119 (status: 0; WIFSIGNALED: 
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 10/16/2012 11:24:20 [998:2247]: epilog exited with exit status 0
> 10/16/2012 11:24:20 [998:2247]: reaped "epilog" with pid 7119
> 10/16/2012 11:24:20 [998:2247]: epilog exited not due to signal
> 10/16/2012 11:24:20 [998:2247]: epilog exited with status 0
> 10/16/2012 11:24:20 [998:2247]: sending SIGTERM to sge_coshepherd
> 
> ==> exit_status <==
> 0
> 
> Seems like a confusing message.
> 
> -- 
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder Office                  FAX: 303-415-9702
> 3380 Mitchell Lane                       [email protected]
> Boulder, CO 80301                   http://www.nwra.com
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Problems with dmtcp migration between processor versions

Reply via email to