Sebastian,
the PSM2 shared memory segment name is set by the PSM2 library and
my understanding is that Open MPI has no control over it.
If you believe the root cause of the crash is related to non unique PSM2
shared
memory segment name, I guess you should report this at
https://github.com/intel/opa-psm2
Below is a snippet from ptl_am/am_reqrep_shmem.c
Cheers,
Gilles
psm2_error_t psmi_shm_create(ptl_t *ptl_gen)
{
// ...
snprintf(shmbuf,
sizeof(shmbuf),
"/psm2_shm.%ld%016lx%d",
(long int) getuid(),
ep->epid,
iterator);
amsh_keyname = psmi_strdup(NULL, shmbuf);
// ...
shmfd =
shm_open(amsh_keyname, O_RDWR | O_CREAT, S_IRUSR |
S_IWUSR);
On 7/5/2019 4:13 AM, Kraus, Sebastian via users wrote:
Hi all,
anyone around there, who could explain me how the naming scheme for the PSM2
and Vader shared memory segments is constructed.
I am curious if there is a possibility to influence the naming scheme via
run-time parameters. I am confronted to the situation where distinct
SLURM jobs of the same user on the same node randomly segfault. I suppose that
the problem is connected with the non-unique naming
scheme of the PSM2 shared memory segments (as determined by openmpi/SLURM).
The PSM segments show the following naming convention:
/dev/shm/psm2_shm.[user_id][some_mask]
Unfortunately, the values of the mask do not change for distinct SLURM jobs.
Instead the names of the Vader segments show uniqueness for
different process ids:
/dev/shm/vader_segment.[nodename].[some_process-mask].[SLURM_STEPID]
An example:
Vader segments:
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.5
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.3
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.1
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.7
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.6
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.0
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.2
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93e00001.4
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.7
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.5
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.1
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.4
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.3
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.0
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.6
-rw------- 1 XXX YYY 4.1M Jul 4 19:09
/dev/shm/vader_segment.nodename.93650001.2
PSM2 segments:
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000007ff0000e00
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000006ff0000c00
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000005ff0000a00
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000003ff0000600
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000002ff0000400
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000001ff0000200
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000000ff0000000
-rw------- 1 XXX YYY 4.2M Jul 4 19:09
/dev/shm/psm2_shm.117648500000004ff0000800
Thanks for your time and support
Sebastian
Sebastian Kraus
Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users