[slurm-dev] Re: scch not found !!
Yes, I did it. BLCR installation : In blcr-0.8.5 I not found any file named scch tar xzf blcr-0.8.5.tar.gz cd blcr-0.8.5 ./configure --enable-multilib=no make rpms cd rpm/RPMS/x86_64 rpm -ivh blcr-0.8.5-1.x86_64.rpm blcr-modules_2.6.32_504.30.3.el6.x86_64-0.8.5-1.x86_64.rpm blcr-libs-0.8.5-1.x86_64.rpm blcr-devel-0.8.5-1.x86_64.rpm SLURM installation rpmbuild -ta --with blcr slurm-15.08.4.tar.bz2 rpm -ivh slurm-plugins-15.08.4-1.el6.x86_64.rpm slurm-15.08.4-1.el6.x86_64.rpm slurm-devel-15.08.4-1.el6.x86_64.rpm slurm-munge-15.08.4-1.el6.x86_64.rpm slurm-perlapi-15.08.4-1.el6.x86_64.rpm slurm-sjobexit-15.08.4-1.el6.x86_64.rpm slurm-sjstat-15.08.4-1.el6.x86_64.rpm slurm-torque-15.08.4-1.el6.x86_64.rpm slurm-blcr-15.08.4-1.el6.x86_64.rpm If I untar slurm-15.08.4.tar.bz2, I not found any file named scch De : Manuel Rodríguez Pascual [mailto:manuel.rodriguez.pasc...@gmail.com] Envoyé : mercredi 10 février 2016 12:35 À : slurm-devObjet : [slurm-dev] Re: scch not found !! hi, You have to install BLCR first. It is quite straightforward though. You can download it from the Berkley Lab web at http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/ 2016-02-10 11:21 GMT+01:00 David Roman >: Hello, I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch not found. I don’t know what I do to install this file ? Is it a part of BLCR or SLURM or other things ? Thanks a lot for your reply. David -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
[slurm-dev] Re: scch not found !!
Hi David, As can be read in man of slurm.conf, checkpoint/blcr Berkeley Lab Checkpoint Restart (BLCR). NOTE: If a file is found at sbin/scch (relative to the Slurm instal- lation location), it will be executed upon completion of the checkpoint. This can be a script used for manag- ing the checkpoint files. NOTE: Slurm’s BLCR logic only supports batch jobs. As far as I know, if that file is not present, the system will complain and stop the checkpoint. So the solution is to create an empty shell script, in your case in /usr/sbin/scch, and it will stop complaining. Please try it, and let me know if it doesn't work. Best regards, Manuel 2016-02-10 14:25 GMT+01:00 David Roman: > Yes, I did it. > > > > BLCR installation : In blcr-0.8.5 I not found any file named scch > > tar xzf blcr-0.8.5.tar.gz > > cd blcr-0.8.5 > > ./configure --enable-multilib=no > > make rpms > > cd rpm/RPMS/x86_64 > > rpm -ivh blcr-0.8.5-1.x86_64.rpm > blcr-modules_2.6.32_504.30.3.el6.x86_64-0.8.5-1.x86_64.rpm > blcr-libs-0.8.5-1.x86_64.rpm blcr-devel-0.8.5-1.x86_64.rpm > > > > > > SLURM installation > > rpmbuild -ta --with blcr slurm-15.08.4.tar.bz2 > > rpm -ivh slurm-plugins-15.08.4-1.el6.x86_64.rpm > slurm-15.08.4-1.el6.x86_64.rpm slurm-devel-15.08.4-1.el6.x86_64.rpm > slurm-munge-15.08.4-1.el6.x86_64.rpm slurm-perlapi-15.08.4-1.el6.x86_64.rpm > slurm-sjobexit-15.08.4-1.el6.x86_64.rpm > slurm-sjstat-15.08.4-1.el6.x86_64.rpm slurm-torque-15.08.4-1.el6.x86_64.rpm > slurm-blcr-15.08.4-1.el6.x86_64.rpm > > > > If I untar slurm-15.08.4.tar.bz2, I not found any file named scch > > > > > > De : Manuel Rodríguez Pascual [mailto:manuel.rodriguez.pasc...@gmail.com] > Envoyé : mercredi 10 février 2016 12:35 > À : slurm-dev > Objet : [slurm-dev] Re: scch not found !! > > > > hi, > > > > You have to install BLCR first. It is quite straightforward though. You can > download it from the Berkley Lab web at > http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/ > > > > > > > > > > 2016-02-10 11:21 GMT+01:00 David Roman : > > Hello, > > > > I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch > not found. > > I don’t know what I do to install this file ? Is it a part of BLCR or SLURM > or other things ? > > > > Thanks a lot for your reply. > > > > David > > > > > > -- > > Dr. Manuel Rodríguez-Pascual > skype: manuel.rodriguez.pascual > phone: (+34) 913466173 // (+34) 679925108 > > CIEMAT-Moncloa > Edificio 22, desp. 1.25 > Avenida Complutense, 40 > 28040- MADRID > SPAIN -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
[slurm-dev] GRES for both K80 GPU's
I have a couple nodes with 4xK80 GPU's in them (nvidia0-7). Is there a way to either request peer-to-peer GPU's, or force allocation to 2 GPU's at a time? We'd prefer for the former (run when peer-to-peer is available, unless you don't care) so we can fit more users onto the machine. However, ensuring the peer-to-peer codes get the proper allocation is more important. User 1 - needs a full K80 with peer-to-peer User 2 - needs a single GPU User 3 - needs a single GPU User 4 - Needs 2 full K80 I.e 0,1 - User 1 2- User 2 3- User 3 4,5,6,7 - User 4 Or 0,1 - User 1 2,3 - User 2 4,5 - User 3 QUEUED - User 4 I tried this gres configuration, but it didn't do what I expected. Name=gpu File=/dev/nvidia[0-1] Count=2 CPUs=0-9 Name=gpu File=/dev/nvidia[2-3] Count=2 CPUs=0-9 Name=gpu File=/dev/nvidia[4-5] Count=2 CPUs=10-19 Name=gpu File=/dev/nvidia[6-7] Count=2 CPUs=10-19
[slurm-dev] EL6 clusters with cgroups enabled
Hi, I’m curious which kernel you are running on your el6 clusters that have cgroups enabled in slurm. I have an issue where some workloads cause 100’s-1000’s of flocks to occur relating to the memory cleanup portion in the cgroup. On the schedmd slurm site, I see the mention of this: * There can be a serious performance problem with memory cgroups on conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due to contention between processors for a spinlock. This problem seems to have been completely fixed in the 2.6.38 kernel. Anyone know what the kernel bug # was so I can find the kernel where this is fixed? I’m thinking this is what I’m seeing, can anyone confirm? I have kernel 2.6.32-504.3.3.el6 , and slurm version: 15.08.4. I’d like to see who has seen this issue and what they did to resolve it. Upgrade to newer kernel? If so which? Is there a fix in the el6 2.6.32 series? Thanks! Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167
[slurm-dev] Re: slurmd ownership
I suspect the slurm init script doesn’t use the `status` command because it is specific to upstart, and not necessarily available in all init systems (e.g., in sysvinit). The slurmd pid file on one of our compute nodes appears to be non-root-readable.[1] What error were you having when trying to check the slurmd status before modifying the init script? ~jonathon [1]: # stat /var/run/slurmd.pid File: `/var/run/slurmd.pid' Size: 5 Blocks: 8 IO Block: 4096 regular file Device: 1h/1d Inode: 108568 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2016-02-10 12:34:39.959156252 -0700 Modify: 2016-02-08 11:23:10.517122210 -0700 Change: 2016-02-08 11:23:10.517122210 -0700 > On Feb 10, 2016, at 10:56 PM, jupiterwrote: > > Hi Jonathon, > > I figured out, the problem is not the root owner ship but the way > /etc/init.d/slurm implements the "service slurm status", it checks the pid > file and caused permission issue. Why did it simply run "status slurmd" which > works perfectly? > > I've modified the status and works fine now, thanks for your response. > > status) > prog="${0##*/}d" > status ${prog} > ;; > > > > > On Thu, Feb 11, 2016 at 1:05 PM, Jonathon A Anderson > wrote: > slurmd must run as root because it forks and execs processes on behalf of > other users using the job owner’s uid. > > I don’t understand what trouble you’re having monitoring slurm with nagios. > Could you give an example of what you’re trying to do, what you expect it to > do, and what it does in stead? > > ~jonathon > > > > On Feb 10, 2016, at 6:30 PM, jupiter wrote: > > > > Hi, > > > > I am running slurm on CentOS 6. One thing I've just noticed is that the > > slurmctld is running under the user slurm but the slurmd is running under > > the root. Not quite sure why those daemons in different owner ships, any > > inside explanation please? Further looked at both slurm daemon configure in > > /etc/init.d and slurm.conf, they are identiccal, how could they behave > > differently? Anyway, can the owner ship of slurmd be changed to the user > > slurm? > > > > The problem I've got now is I am running nagios monitoring via ssh, it can > > check all other application daemon status, but it always failed to check > > slrum daemon status due to the slurmd root access restriction. > > > > Thanks. > > > > > >
[slurm-dev] Re: AllowGroups and AD
On 2016-02-10 15:12, Diego Zuccato wrote: Hello all. I think I'm doing something wrong, but I don't understand what. I'm trying to limit users allowed to use a partition (that, coming from Torque, I think is the equivalent of a queue), but obviously I'm failing. :( Frontend and work nodes are all Debians joined to AD via Winbind (that ensures consistent UID/GID mapping, at the expense of having many groups and a bit of slowness while looking 'em up). On every node I can run 'id' and it says (redacted): uid=108036(diego.zuccato) gid=100013(domain_users) gruppi=100013(domain_users),[...],242965(str957.tecnici),[...] (it takes about 10s to get the complete list of groups). Linux ACLs work as expected (if I set a file to be readable only by Str957.tecnici I can read it), but when I do scontrol update PartitionName=pp_base AllowGroups=str957.tecnici or even scontrol update PartitionName=pp_base AllowGroups=242965 when I try to sbath a job I get: diego.zuccato@Str957-cluster:~$ sbatch aaa.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition diego.zuccato@Str957-cluster:~$ newgrp Str957.tecnici diego.zuccato@Str957-cluster:~$ sbatch aaa.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition So I won't get recognized even if I change my primary GID :( I've been in that group since way before installing the cluster, and I already tried rebooting everyting to refresh the cache. Another detail that can be useful: diego.zuccato@Str957-cluster:~$ time getent group str957.tecnici str957.tecnici:x:242965:[...],diego.zuccato,[...] real0m0.012s user0m0.000s sys 0m0.000s Any hints? TIA Hi, do you have user and group enumeration enabled in winbind? I.e. does $ getent passwd and $ getent group return nothing, or the entire user and group lists? FWIW, slurm 16.05 will have some changes to work better in environments with enumeration disabled, see http://bugs.schedmd.com/show_bug.cgi?id=1629 -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: EL6 clusters with cgroups enabled
On 11/02/16 06:15, Christopher B Coffey wrote: > I’m curious which kernel you are running on your el6 clusters that > have cgroups enabled in slurm. I have an issue where some workloads > cause 100’s-1000’s of flocks to occur relating to the memory cleanup > portion in the cgroup. This is kernel code, or userspace? My understanding of the kernel developers concerns over memory cgroups was around the extra overhead in memory allocation inside the kernel. Here's a write up from LWN from the 2012 mm minisummit at the Kernel Summit on the issue: https://lwn.net/Articles/516533/ Interestingly the RHEL page mentions a memory overhead on x86-64. but not a performance issue, so whether they backported later patches to reduce the impact of memory cgroups I cannot tell right now. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html I did benchmarking a few years back when were transitioning to RHEL6 and Slurm with memory cgroups enabled and couldn't see any significant difference in performance. Unfortunately I suspect I cleaned all that up some time ago. :-( We use them and haven't noticed any issues yet. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: slurmd ownership
Hi Jonathon, I figured out, the problem is not the root owner ship but the way /etc/init.d/slurm implements the "service slurm status", it checks the pid file and caused permission issue. Why did it simply run "status slurmd" which works perfectly? I've modified the status and works fine now, thanks for your response. status) prog="${0##*/}d" status ${prog} ;; On Thu, Feb 11, 2016 at 1:05 PM, Jonathon A Anderson < jonathon.ander...@colorado.edu> wrote: > slurmd must run as root because it forks and execs processes on behalf of > other users using the job owner’s uid. > > I don’t understand what trouble you’re having monitoring slurm with > nagios. Could you give an example of what you’re trying to do, what you > expect it to do, and what it does in stead? > > ~jonathon > > > > On Feb 10, 2016, at 6:30 PM, jupiterwrote: > > > > Hi, > > > > I am running slurm on CentOS 6. One thing I've just noticed is that the > slurmctld is running under the user slurm but the slurmd is running under > the root. Not quite sure why those daemons in different owner ships, any > inside explanation please? Further looked at both slurm daemon configure in > /etc/init.d and slurm.conf, they are identiccal, how could they behave > differently? Anyway, can the owner ship of slurmd be changed to the user > slurm? > > > > The problem I've got now is I am running nagios monitoring via ssh, it > can check all other application daemon status, but it always failed to > check slrum daemon status due to the slurmd root access restriction. > > > > Thanks. > > > > > >
[slurm-dev] Re: slurmd ownership
On 11/02/16 12:31, jupiter wrote: > I am running slurm on CentOS 6. One thing I've just noticed is that the > slurmctld is running under the user slurm but the slurmd is running > under the root. slurmd has to run as root as it must be able to start processes as the users whose jobs are to run. http://slurm.schedmd.com/quickstart_admin.html # The slurmd daemon executes on every compute node. It resembles # a remote shell daemon to export control to Slurm. Because slurmd # initiates and manages user jobs, it must execute as the user root. All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Setting up SLURM for a single multi-core node
Hello, I'm trying to set up SLURM-15.08.1 on a single multi-core node to manage multi-threaded jobs. The machine has 16 physical cores on 2 sockets with HyperThreading enabled. I'm using the EASY scheduling algorithm with backfilling. The goal is to fully utilize all the available cores at all times. Given a list of three jobs with requirements of 8 cores, 2 cores, and 4 cores, the expectation is that the jobs should be co-scheduled to utilize 14 of the 16 available cores. However, I can't seem to get SLURM to work as expected. SLURM runs the latter two jobs together but refuses to schedule the first job until they finish. (Is this the expected behavior of the EASY-backfilling algorithm?) Here's the list of jobs: $ cat job1.batch #!/bin/bash #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=8 #SBATCH --threads-per-core=1 srun /path/to/application1 $ cat job2.batch #!/bin/bash #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=2 #SBATCH --threads-per-core=1 srun /path/to/application2 $ cat job3.batch #!/bin/bash #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=4 #SBATCH --threads-per-core=1 srun /path/to/application3 Here's my SLURM config: $ cat /path/to/slurm.conf ControlMachine=localhost ControlAddr=127.0.0.1 AuthType=auth/none CacheGroups=0 CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/path/to/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/path/to/pids/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/path/to/slurmdspooldir SlurmUser=myuserid SlurmdUser=myuserid StateSaveLocation=/path/to/states SwitchType=switch/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_CORE AccountingStorageLoc=/path/to/accounting.log AccountingStorageType=accounting_storage/filetxt AccountingStoreJobComment=YES ClusterName=cluster JobCompLoc=/path/to/completion.log JobCompType=jobcomp/filetxt JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmdDebug=3 NodeName=localhost NodeAddr=127.0.0.1 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP DebugFlags=Backfill,CPU_Bind,Priority,SelectType I'm a SLURM newbie so I might be missing something obvious. I'd appreciate any help. Thanks, -Rohan
[slurm-dev] Re: slurm-dev Job not using cores from different nodes
Dear Benjamin, Many thanks for your answer. The blades have all 32 cores but I left one free since there is a VM running on each of them, don't really know if that helps but I wanted to be on the safe side :) I submitted the job with sbatch and the following command: #!/bin/bash #SBATCH -n 80 # number of cores #SBATCH -o /mnt/nfs/bio/HPC_related_material/Jobs_STDOUT_logs/slurm.%N.%j.out # STDOUT #SBATCH -e /mnt/nfs/bio/HPC_related_material/Jobs_STDERR_logs/slurm.%N.%j.err # STDERR perl /mnt/nfs/bio/Script_test_folder/Mira_script.pl And the mira manifest file (don't know if you have experience with this assembler?) is written in a way that the software should use the total amount of allocated cores: project = 36 job = genome,denovo,accurate readgroup = Illumina_Paired_End_files autopairing data = Stool_R1_36_NEB_Ultra.fastq Stool_R2_36_NEB_Ultra.fastq technology = solexa parameters = -GENERAL:number_of_threads=80 -GENERAL:mps=0 -SK:mchr=2048 -SK:mhpr=1000 -AS:sep=on -AS:nop=4 -NW:cmrnl=warn -OUT:rtd=on -OUT:orc=off -OUT:orm=off -OUT:ora=off -OUT:ort=off -NW:cnfs=warn Do you see anything here that could be the cause of the issue? Best regards, Pierre 2016-02-04 19:41 GMT+01:00 Benjamin Redling: > > Can you post how you submitted the job? > Mira on 60 cores needs MPI in your case. Multi threading works w/o > > BTW. Your config says 31cpus. Generated without incr index or intended? > > Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger < > pierre.schneeber...@gmail.com>: > >Hi there, > > > >I'm setting up a small cluster composed of 4 blades with 32 (physical) > >cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb > >RAM). A > >CentOS 7 VM is running on each blade. > >The slurm controller service is up and running on one of the blade, and > >the > >daemon service has been installed on each of the four blades (up and > >running as well). > > > >A few days ago, I submitted a job using the MIRA assembler > >(multithreaded) > >on 60 cores and it worked well, using all the resources I allocated to > >the > >job. At that point, only 2 blades (including the one with the > >controller) > >were running and the job was completed successfully using 60 cores when > >needed. > > > >The problem appeared when I added the 2 last blades and it seems that > >it > >doesn't matter how much resources (cores) I allocate to a job, it now > >runs > >on a maximum of 32 cores (the number of physical cores per node). > >I tried it with 60, 90 and 120 cores but MIRA, according to the system > >monitor from CentOS, seem to use only a maximum of 32 cores (all cores > >from > >one node but none of the others that were allocated). Is it possible > >that > >there is a communication issue between the nodes? (although all seem > >available when using the sinfo command). > > > >I tried to restart the different services (controller/slaves) but it > >doesn't seem to help. > > > >I would be grateful if someone could give me a hint on how to solve > >this > >issue, > > > >Many thanks in advance, > >Pierre > > > >Here is the *slurm.conf* information: > > > ># slurm.conf file generated by configurator easy.html. > ># Put this file on all nodes of your cluster. > ># See the slurm.conf man page for more information. > ># > >ControlMachine=hpc-srvbio-03 > >ControlAddr=192.168.12.12 > ># > >#MailProg=/bin/mail > >MpiDefault=none > >#MpiParams=ports=#-# > >ProctrackType=proctrack/pgid > >ReturnToService=1 > >SlurmctldPidFile=/var/run/slurmctld.pid > >#SlurmctldPort=6817 > >SlurmdPidFile=/var/run/slurmd.pid > >#SlurmdPort=6818 > >SlurmdSpoolDir=/var/spool/slurmd > >SlurmUser=root > >#SlurmdUser=root > >StateSaveLocation=/var/spool/slurmctld > >SwitchType=switch/none > >TaskPlugin=task/none > ># > ># > ># TIMERS > >#KillWait=30 > >#MinJobAge=300 > >#SlurmctldTimeout=120 > >#SlurmdTimeout=300 > ># > ># > ># SCHEDULING > >FastSchedule=1 > >SchedulerType=sched/backfill > >#SchedulerPort=7321 > >SelectType=select/linear > ># > ># > ># LOGGING AND ACCOUNTING > >AccountingStorageType=accounting_storage/filetxt > >ClusterName=cluster > >#JobAcctGatherFrequency=30 > >JobAcctGatherType=jobacct_gather/none > >#SlurmctldDebug=3 > >#SlurmctldLogFile= > >#SlurmdDebug=3 > >#SlurmdLogFile= > ># > ># > ># COMPUTE NODES > >#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN > >PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2] > >Default=YES MaxTime=INFINITE State=UP > >NodeName=DEFAULT CPUs=31 RealMemory=75 TmpDisk=36758 > >NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12 > >NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13 > >NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11 > >NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10 > > -- > FSU Jena | JULIELab.de/Staff/Benjamin+Redling.HTML > vox: +49 3641 9 44323 | fax: +49 3641 9 44321 >
[slurm-dev] scch not found !!
Hello, I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch not found. I don't know what I do to install this file ? Is it a part of BLCR or SLURM or other things ? Thanks a lot for your reply. David
[slurm-dev] Re: scch not found !!
hi, You have to install BLCR first. It is quite straightforward though. You can download it from the Berkley Lab web at http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/ 2016-02-10 11:21 GMT+01:00 David Roman: > Hello, > > > > I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch > not found. > > I don’t know what I do to install this file ? Is it a part of BLCR or > SLURM or other things ? > > > > Thanks a lot for your reply. > > > > David > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
[slurm-dev] AllowGroups and AD
Hello all. I think I'm doing something wrong, but I don't understand what. I'm trying to limit users allowed to use a partition (that, coming from Torque, I think is the equivalent of a queue), but obviously I'm failing. :( Frontend and work nodes are all Debians joined to AD via Winbind (that ensures consistent UID/GID mapping, at the expense of having many groups and a bit of slowness while looking 'em up). On every node I can run 'id' and it says (redacted): uid=108036(diego.zuccato) gid=100013(domain_users) gruppi=100013(domain_users),[...],242965(str957.tecnici),[...] (it takes about 10s to get the complete list of groups). Linux ACLs work as expected (if I set a file to be readable only by Str957.tecnici I can read it), but when I do scontrol update PartitionName=pp_base AllowGroups=str957.tecnici or even scontrol update PartitionName=pp_base AllowGroups=242965 when I try to sbath a job I get: diego.zuccato@Str957-cluster:~$ sbatch aaa.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition diego.zuccato@Str957-cluster:~$ newgrp Str957.tecnici diego.zuccato@Str957-cluster:~$ sbatch aaa.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition So I won't get recognized even if I change my primary GID :( I've been in that group since way before installing the cluster, and I already tried rebooting everyting to refresh the cache. Another detail that can be useful: diego.zuccato@Str957-cluster:~$ time getent group str957.tecnici str957.tecnici:x:242965:[...],diego.zuccato,[...] real0m0.012s user0m0.000s sys 0m0.000s Any hints? TIA -- Diego Zuccato Servizi Informatici Dip. di Fisica e Astronomia (DIFA) - Universitᅵ di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 mail: diego.zucc...@unibo.it
[slurm-dev] Re: slurmd ownership
slurmd must run as root because it forks and execs processes on behalf of other users using the job owner’s uid. I don’t understand what trouble you’re having monitoring slurm with nagios. Could you give an example of what you’re trying to do, what you expect it to do, and what it does in stead? ~jonathon > On Feb 10, 2016, at 6:30 PM, jupiterwrote: > > Hi, > > I am running slurm on CentOS 6. One thing I've just noticed is that the > slurmctld is running under the user slurm but the slurmd is running under the > root. Not quite sure why those daemons in different owner ships, any inside > explanation please? Further looked at both slurm daemon configure in > /etc/init.d and slurm.conf, they are identiccal, how could they behave > differently? Anyway, can the owner ship of slurmd be changed to the user > slurm? > > The problem I've got now is I am running nagios monitoring via ssh, it can > check all other application daemon status, but it always failed to check > slrum daemon status due to the slurmd root access restriction. > > Thanks. > >
[slurm-dev] slurmd ownership
Hi, I am running slurm on CentOS 6. One thing I've just noticed is that the slurmctld is running under the user slurm but the slurmd is running under the root. Not quite sure why those daemons in different owner ships, any inside explanation please? Further looked at both slurm daemon configure in /etc/init.d and slurm.conf, they are identiccal, how could they behave differently? Anyway, can the owner ship of slurmd be changed to the user slurm? The problem I've got now is I am running nagios monitoring via ssh, it can check all other application daemon status, but it always failed to check slrum daemon status due to the slurmd root access restriction. Thanks.