[slurm-users] Checkpoint

2021-11-29 Thread Alberto Morillas, Angelines
Hi! I need your help How could I use chekpoint (dmtcp) with slurm? Thanks in advance Angelines

[slurm-users] xalloc

2021-03-09 Thread Alberto Morillas, Angelines
Hi, I need your help. I have users that need an interactive shell on a compute node with the possibility of running programs with a graphical user interface directly on the compute node. Looking for information I have found the xalloc command but it must be a wrapper because It isn`t installed

Re: [slurm-users] Get original script of a job

2021-03-07 Thread Alberto Morillas, Angelines
edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: slurm-users Digest, Vol 41, Issue 13 (Alberto Morillas, Angelines) 2. Re: Get original script of a job (Ward Poelmans) 3. Re: Get orig

Re: [slurm-users] slurm-users Digest, Vol 41, Issue 13

2021-03-05 Thread Alberto Morillas, Angelines
05-03-2021 11:29, Alberto Morillas, Angelines wrote: > I would like to know if it will be possible to get the script that was > used to send a job. > > I know that when I send a job with scontroI can get the path and the > name of the script used to send this job, but normally the us

[slurm-users] Get original script of a job

2021-03-05 Thread Alberto Morillas, Angelines
Hi, I would like to know if it will be possible to get the script that was used to send a job. I know that when I send a job with scontroI can get the path and the name of the script used to send this job, but normally the users change theirs scripts and sometimes all was wrong after that, so

Re: [slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
... [2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for JobId=964556 with not NULL job resources [2020-06-30T11:46:52.740] error: select_nodes: calling _get_req_features() for JobId=964574 with not NULL job resources [2020-06-30T11:46:52.741] error: select_nodes:

[slurm-users] fail job

2020-06-30 Thread Alberto Morillas, Angelines
Hi, We have slurm version 18.08.6 One of my nodes is in drain state Reason=Kill task failed [root@2020-06-27T02:25:29] In the node I can see in the slurmd.log 2020-06-27T01:24:26.242] task_p_slurmd_batch_request: 963771 [2020-06-27T01:24:26.242] task/affinity: job 963771 CPU input mask for

Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

2020-06-01 Thread Alberto Morillas, Angelines
band in the Open MPI 4.0.x release stream. Howard ?On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, Angelines" wrote: Hello Howard I installed it with spack: openmpi@4.0.3 -cuda +cxx_exceptions fabrics=verbs -java -legacylauncher

Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

2020-06-01 Thread Alberto Morillas, Angelines
-Type: text/plain; charset="utf-8" Hello Angelines, Do you know how the Open MPI 4.0.3 package was configured and built? That information would be useful to help diagnose the problem. Thanks, Howard From: slurm-users on behalf of "Alberto Mor

[slurm-users] problems with OpenMPI 4.0.3

2020-05-29 Thread Alberto Morillas, Angelines
Good morning, We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6. Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a

[slurm-users] problems with number of jobs with GrpTres

2020-05-19 Thread Alberto Morillas, Angelines
Hello, I have a problem with GrpTres, I specify the limits with sacctmgr --immediate modify user where user= set GrpTres=cpu=144,node=4 but when the user send serial jobs, for example 5 jobs , the user only can execute 4, and the rest of the jobs are PD with the reason=AssocGrpNodeLimit.

[slurm-users] slurm problem with GrpTres

2020-05-04 Thread Alberto Morillas, Angelines
Hello, I have a problem with GrpTres, I specify the limits with sacctmgr --immediate modify user where user= set GrpTres=cpu=144,node=4 but when the user send serial jobs, for example 5 jobs , the user only can execute 4, and the rest of the jobs are PD with the reason=AssocGrpNodeLimit.