[slurm-dev] Re: scch not found !!

2016-02-10 Thread David Roman
Yes, I did it.

BLCR installation : In blcr-0.8.5 I not found any file named scch
tar xzf blcr-0.8.5.tar.gz
cd blcr-0.8.5
./configure --enable-multilib=no
make rpms
cd rpm/RPMS/x86_64
rpm -ivh blcr-0.8.5-1.x86_64.rpm 
blcr-modules_2.6.32_504.30.3.el6.x86_64-0.8.5-1.x86_64.rpm 
blcr-libs-0.8.5-1.x86_64.rpm blcr-devel-0.8.5-1.x86_64.rpm


SLURM installation

rpmbuild -ta --with blcr slurm-15.08.4.tar.bz2

rpm -ivh slurm-plugins-15.08.4-1.el6.x86_64.rpm slurm-15.08.4-1.el6.x86_64.rpm 
slurm-devel-15.08.4-1.el6.x86_64.rpm slurm-munge-15.08.4-1.el6.x86_64.rpm 
slurm-perlapi-15.08.4-1.el6.x86_64.rpm slurm-sjobexit-15.08.4-1.el6.x86_64.rpm 
slurm-sjstat-15.08.4-1.el6.x86_64.rpm slurm-torque-15.08.4-1.el6.x86_64.rpm 
slurm-blcr-15.08.4-1.el6.x86_64.rpm

If I untar slurm-15.08.4.tar.bz2, I not found any file named scch


De : Manuel Rodríguez Pascual [mailto:manuel.rodriguez.pasc...@gmail.com]
Envoyé : mercredi 10 février 2016 12:35
À : slurm-dev 
Objet : [slurm-dev] Re: scch not found !!

hi,

You have to install BLCR first. It is quite straightforward though. You can 
download it from the Berkley Lab web at 
http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/




2016-02-10 11:21 GMT+01:00 David Roman 
>:
Hello,

I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch not 
found.
I don’t know what I do to install this file ? Is it a part of BLCR or SLURM or 
other things ?

Thanks a lot for your reply.

David



--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN


[slurm-dev] Re: scch not found !!

2016-02-10 Thread Manuel Rodríguez Pascual

Hi David,

As can be read in man of slurm.conf,

checkpoint/blcr   Berkeley Lab Checkpoint Restart (BLCR).  NOTE: If a
file is found at sbin/scch (relative to the Slurm instal-

lation location), it will be executed
upon completion of the checkpoint. This can be a script used for
manag-

ing the checkpoint files.  NOTE:
Slurm’s BLCR logic only supports batch jobs.


As far as I know, if that file is not present, the system will
complain and stop the checkpoint. So the solution is to create an
empty shell script, in your case in /usr/sbin/scch, and it will stop
complaining.

Please try it, and let me know if it doesn't work.

Best regards,


Manuel



2016-02-10 14:25 GMT+01:00 David Roman :
> Yes, I did it.
>
>
>
> BLCR installation : In blcr-0.8.5 I not found any file named scch
>
> tar xzf blcr-0.8.5.tar.gz
>
> cd blcr-0.8.5
>
> ./configure --enable-multilib=no
>
> make rpms
>
> cd rpm/RPMS/x86_64
>
> rpm -ivh blcr-0.8.5-1.x86_64.rpm
> blcr-modules_2.6.32_504.30.3.el6.x86_64-0.8.5-1.x86_64.rpm
> blcr-libs-0.8.5-1.x86_64.rpm blcr-devel-0.8.5-1.x86_64.rpm
>
>
>
>
>
> SLURM installation
>
> rpmbuild -ta --with blcr slurm-15.08.4.tar.bz2
>
> rpm -ivh slurm-plugins-15.08.4-1.el6.x86_64.rpm
> slurm-15.08.4-1.el6.x86_64.rpm slurm-devel-15.08.4-1.el6.x86_64.rpm
> slurm-munge-15.08.4-1.el6.x86_64.rpm slurm-perlapi-15.08.4-1.el6.x86_64.rpm
> slurm-sjobexit-15.08.4-1.el6.x86_64.rpm
> slurm-sjstat-15.08.4-1.el6.x86_64.rpm slurm-torque-15.08.4-1.el6.x86_64.rpm
> slurm-blcr-15.08.4-1.el6.x86_64.rpm
>
>
>
> If I untar slurm-15.08.4.tar.bz2, I not found any file named scch
>
>
>
>
>
> De : Manuel Rodríguez Pascual [mailto:manuel.rodriguez.pasc...@gmail.com]
> Envoyé : mercredi 10 février 2016 12:35
> À : slurm-dev 
> Objet : [slurm-dev] Re: scch not found !!
>
>
>
> hi,
>
>
>
> You have to install BLCR first. It is quite straightforward though. You can
> download it from the Berkley Lab web at
> http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/
>
>
>
>
>
>
>
>
>
> 2016-02-10 11:21 GMT+01:00 David Roman :
>
> Hello,
>
>
>
> I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch
> not found.
>
> I don’t know what I do to install this file ? Is it a part of BLCR or SLURM
> or other things ?
>
>
>
> Thanks a lot for your reply.
>
>
>
> David
>
>
>
>
>
> --
>
> Dr. Manuel Rodríguez-Pascual
> skype: manuel.rodriguez.pascual
> phone: (+34) 913466173 // (+34) 679925108
>
> CIEMAT-Moncloa
> Edificio 22, desp. 1.25
> Avenida Complutense, 40
> 28040- MADRID
> SPAIN



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

[slurm-dev] GRES for both K80 GPU's

2016-02-10 Thread Michael Senizaiz
I have a couple nodes with 4xK80 GPU's in them (nvidia0-7).

Is there a way to either request peer-to-peer GPU's, or force allocation to
2 GPU's at a time?  We'd prefer for the former (run when peer-to-peer is
available, unless you don't care) so we can fit more users onto the
machine.  However, ensuring the peer-to-peer codes get the proper
allocation is more important.


User 1 - needs a full K80 with peer-to-peer
User 2 - needs a single GPU
User 3 - needs a single GPU
User 4 - Needs 2 full K80

I.e
0,1 - User 1
2- User 2
3- User 3
4,5,6,7 - User 4

Or

0,1 - User 1
2,3  - User 2
4,5   - User 3
QUEUED - User 4

I tried this gres configuration, but it didn't do what I expected.

Name=gpu File=/dev/nvidia[0-1] Count=2 CPUs=0-9
Name=gpu File=/dev/nvidia[2-3] Count=2 CPUs=0-9
Name=gpu File=/dev/nvidia[4-5] Count=2 CPUs=10-19
Name=gpu File=/dev/nvidia[6-7] Count=2 CPUs=10-19


[slurm-dev] EL6 clusters with cgroups enabled

2016-02-10 Thread Christopher B Coffey
Hi,

I’m curious which kernel you are running on your el6 clusters that have cgroups 
enabled in slurm.  I have an issue where some workloads cause 100’s-1000’s of 
flocks to occur relating to the memory cleanup portion in the cgroup.  On the 
schedmd slurm site, I see the mention of this:

* There can be a serious performance problem with memory cgroups on 
conventional multi-socket, multi-core nodes in kernels prior to 2.6.38 due to 
contention between processors for a spinlock. This problem seems to have been 
completely fixed in the 2.6.38 kernel.

Anyone know what the kernel bug # was so I can find the kernel where this is 
fixed?

I’m thinking this is what I’m seeing, can anyone confirm?  I have kernel 
2.6.32-504.3.3.el6 , and slurm version: 15.08.4. 


I’d like to see who has seen this issue and what they did to resolve it.  
Upgrade to newer kernel?  If so which? Is there a fix in the el6 2.6.32 series? 
Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167






[slurm-dev] Re: slurmd ownership

2016-02-10 Thread Jonathon A Anderson
I suspect the slurm init script doesn’t use the `status` command because it is 
specific to upstart, and not necessarily available in all init systems (e.g., 
in sysvinit).

The slurmd pid file on one of our compute nodes appears to be 
non-root-readable.[1] What error were you having when trying to check the 
slurmd status before modifying the init script?

~jonathon


[1]: # stat /var/run/slurmd.pid 
  File: `/var/run/slurmd.pid'
  Size: 5   Blocks: 8  IO Block: 4096   regular file
Device: 1h/1d   Inode: 108568  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2016-02-10 12:34:39.959156252 -0700
Modify: 2016-02-08 11:23:10.517122210 -0700
Change: 2016-02-08 11:23:10.517122210 -0700


> On Feb 10, 2016, at 10:56 PM, jupiter  wrote:
> 
> Hi Jonathon,
> 
> I figured out, the problem is not the root owner ship but the way 
> /etc/init.d/slurm implements the "service slurm status", it checks the pid 
> file and caused permission issue. Why did it simply run "status slurmd" which 
> works perfectly?
> 
> I've modified the status and works fine now, thanks for your response.
> 
> status)
> prog="${0##*/}d"
> status ${prog}
> ;;
> 
> 
> 
> 
> On Thu, Feb 11, 2016 at 1:05 PM, Jonathon A Anderson 
>  wrote:
> slurmd must run as root because it forks and execs processes on behalf of 
> other users using the job owner’s uid.
> 
> I don’t understand what trouble you’re having monitoring slurm with nagios. 
> Could you give an example of what you’re trying to do, what you expect it to 
> do, and what it does in stead?
> 
> ~jonathon
> 
> 
> > On Feb 10, 2016, at 6:30 PM, jupiter  wrote:
> >
> > Hi,
> >
> > I am running slurm on CentOS 6. One thing I've just noticed is that the 
> > slurmctld is running under the user slurm but the slurmd is running under 
> > the root. Not quite sure why those daemons in different owner ships, any 
> > inside explanation please? Further looked at both slurm daemon configure in 
> > /etc/init.d and slurm.conf, they are identiccal, how could they behave 
> > differently? Anyway, can the owner ship of slurmd be changed to the user 
> > slurm?
> >
> > The problem I've got now is I am running nagios monitoring via ssh, it can 
> > check all other application daemon status, but it always failed to check 
> > slrum daemon status due to the slurmd root access restriction.
> >
> > Thanks.
> >
> >
> 
> 



[slurm-dev] Re: AllowGroups and AD

2016-02-10 Thread Janne Blomqvist


On 2016-02-10 15:12, Diego Zuccato wrote:


Hello all.

I think I'm doing something wrong, but I don't understand what.

I'm trying to limit users allowed to use a partition (that, coming from
Torque, I think is the equivalent of a queue), but obviously I'm failing. :(

Frontend and work nodes are all Debians joined to AD via Winbind (that
ensures consistent UID/GID mapping, at the expense of having many groups
and a bit of slowness while looking 'em up).
On every node I can run 'id' and it says (redacted):
uid=108036(diego.zuccato) gid=100013(domain_users)
gruppi=100013(domain_users),[...],242965(str957.tecnici),[...]

(it takes about 10s to get the complete list of groups).

Linux ACLs work as expected (if I set a file to be readable only by
Str957.tecnici I can read it), but when I do
scontrol update PartitionName=pp_base AllowGroups=str957.tecnici
or even
scontrol update PartitionName=pp_base AllowGroups=242965

when I try to sbath a job I get:
diego.zuccato@Str957-cluster:~$ sbatch aaa.sh
sbatch: error: Batch job submission failed: User's group not permitted
to use this partition
diego.zuccato@Str957-cluster:~$ newgrp Str957.tecnici
diego.zuccato@Str957-cluster:~$ sbatch aaa.sh
sbatch: error: Batch job submission failed: User's group not permitted
to use this partition

So I won't get recognized even if I change my primary GID :(

I've been in that group since way before installing the cluster, and I
already tried rebooting everyting to refresh the cache.

Another detail that can be useful:
diego.zuccato@Str957-cluster:~$ time getent group str957.tecnici
str957.tecnici:x:242965:[...],diego.zuccato,[...]

real0m0.012s
user0m0.000s
sys 0m0.000s

Any hints?

TIA


Hi,

do you have user and group enumeration enabled in winbind? I.e. does

$ getent passwd

and

$ getent group

return nothing, or the entire user and group lists?

FWIW, slurm 16.05 will have some changes to work better in environments with 
enumeration disabled, see http://bugs.schedmd.com/show_bug.cgi?id=1629

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: EL6 clusters with cgroups enabled

2016-02-10 Thread Christopher Samuel

On 11/02/16 06:15, Christopher B Coffey wrote:

> I’m curious which kernel you are running on your el6 clusters that
> have cgroups enabled in slurm.  I have an issue where some workloads
> cause 100’s-1000’s of flocks to occur relating to the memory cleanup
> portion in the cgroup.

This is kernel code, or userspace?

My understanding of the kernel developers concerns over memory cgroups
was around the extra overhead in memory allocation inside the kernel.

Here's a write up from LWN from the 2012 mm minisummit at the Kernel
Summit on the issue:

https://lwn.net/Articles/516533/

Interestingly the RHEL page mentions a memory overhead on x86-64. but
not a performance issue, so whether they backported later patches to
reduce the impact of memory cgroups I cannot tell right now.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

I did benchmarking a few years back when were transitioning to
RHEL6 and Slurm with memory cgroups enabled and couldn't see any
significant difference in performance.  Unfortunately I suspect I
cleaned all that up some time ago. :-(

We use them and haven't noticed any issues yet.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: slurmd ownership

2016-02-10 Thread jupiter
Hi Jonathon,

I figured out, the problem is not the root owner ship but the way
/etc/init.d/slurm implements the "service slurm status", it checks the pid
file and caused permission issue. Why did it simply run "status slurmd"
which works perfectly?

I've modified the status and works fine now, thanks for your response.

status)
prog="${0##*/}d"
status ${prog}
;;




On Thu, Feb 11, 2016 at 1:05 PM, Jonathon A Anderson <
jonathon.ander...@colorado.edu> wrote:

> slurmd must run as root because it forks and execs processes on behalf of
> other users using the job owner’s uid.
>
> I don’t understand what trouble you’re having monitoring slurm with
> nagios. Could you give an example of what you’re trying to do, what you
> expect it to do, and what it does in stead?
>
> ~jonathon
>
>
> > On Feb 10, 2016, at 6:30 PM, jupiter  wrote:
> >
> > Hi,
> >
> > I am running slurm on CentOS 6. One thing I've just noticed is that the
> slurmctld is running under the user slurm but the slurmd is running under
> the root. Not quite sure why those daemons in different owner ships, any
> inside explanation please? Further looked at both slurm daemon configure in
> /etc/init.d and slurm.conf, they are identiccal, how could they behave
> differently? Anyway, can the owner ship of slurmd be changed to the user
> slurm?
> >
> > The problem I've got now is I am running nagios monitoring via ssh, it
> can check all other application daemon status, but it always failed to
> check slrum daemon status due to the slurmd root access restriction.
> >
> > Thanks.
> >
> >
>
>


[slurm-dev] Re: slurmd ownership

2016-02-10 Thread Christopher Samuel

On 11/02/16 12:31, jupiter wrote:

> I am running slurm on CentOS 6. One thing I've just noticed is that the
> slurmctld is running under the user slurm but the slurmd is running
> under the root.

slurmd has to run as root as it must be able to start processes as the
users whose jobs are to run.

http://slurm.schedmd.com/quickstart_admin.html

# The slurmd daemon executes on every compute node. It resembles
# a remote shell daemon to export control to Slurm. Because slurmd
# initiates and manages user jobs, it must execute as the user root.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Setting up SLURM for a single multi-core node

2016-02-10 Thread Rohan Garg

Hello,

I'm trying to set up SLURM-15.08.1 on a single multi-core node to
manage multi-threaded jobs. The machine has 16 physical cores
on 2 sockets with HyperThreading enabled. I'm using the EASY
scheduling algorithm with backfilling. The goal is to fully utilize all
the available cores at all times.

Given a list of three jobs with requirements of 8 cores, 2 cores,
and 4 cores, the expectation is that the jobs should be co-scheduled
to utilize 14 of the 16 available cores.  However, I can't seem to
get SLURM to work as expected. SLURM runs the latter two jobs
together but refuses to schedule the first job until they finish.
(Is this the expected behavior of the EASY-backfilling algorithm?)

Here's the list of jobs:

  $ cat job1.batch

#!/bin/bash
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=8
#SBATCH --threads-per-core=1
srun /path/to/application1
  
  $ cat job2.batch
  
#!/bin/bash
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=2
#SBATCH --threads-per-core=1
srun /path/to/application2
  
  $ cat job3.batch
  
#!/bin/bash
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=4
#SBATCH --threads-per-core=1
srun /path/to/application3

Here's my SLURM config:

  $ cat /path/to/slurm.conf

ControlMachine=localhost
ControlAddr=127.0.0.1
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/path/to/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/path/to/pids/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/path/to/slurmdspooldir
SlurmUser=myuserid
SlurmdUser=myuserid
StateSaveLocation=/path/to/states
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CORE
AccountingStorageLoc=/path/to/accounting.log
AccountingStorageType=accounting_storage/filetxt
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/path/to/completion.log
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmdDebug=3
NodeName=localhost NodeAddr=127.0.0.1 Sockets=2 CoresPerSocket=8 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
DebugFlags=Backfill,CPU_Bind,Priority,SelectType

I'm a SLURM newbie so I might be missing something obvious. I'd
appreciate any help.

Thanks,
-Rohan


[slurm-dev] Re: slurm-dev Job not using cores from different nodes

2016-02-10 Thread Pierre Schneeberger
Dear Benjamin,

Many thanks for your answer.
The blades have all 32 cores but I left one free since there is a VM
running on each of them, don't really know if that helps but I wanted to be
on the safe side :)

I submitted the job with sbatch and the following command:

#!/bin/bash
#SBATCH -n 80 # number of cores
#SBATCH -o
/mnt/nfs/bio/HPC_related_material/Jobs_STDOUT_logs/slurm.%N.%j.out # STDOUT
#SBATCH -e
/mnt/nfs/bio/HPC_related_material/Jobs_STDERR_logs/slurm.%N.%j.err # STDERR
perl /mnt/nfs/bio/Script_test_folder/Mira_script.pl

And the mira manifest file (don't know if you have experience with this
assembler?) is written in a way that the software should use the total
amount of allocated cores:

project = 36
job = genome,denovo,accurate
readgroup = Illumina_Paired_End_files
autopairing
data = Stool_R1_36_NEB_Ultra.fastq Stool_R2_36_NEB_Ultra.fastq
technology = solexa
parameters = -GENERAL:number_of_threads=80 -GENERAL:mps=0 -SK:mchr=2048
-SK:mhpr=1000 -AS:sep=on -AS:nop=4 -NW:cmrnl=warn -OUT:rtd=on -OUT:orc=off
-OUT:orm=off -OUT:ora=off -OUT:ort=off -NW:cnfs=warn

Do you see anything here that could be the cause of the issue?


Best regards,
Pierre


2016-02-04 19:41 GMT+01:00 Benjamin Redling :

>
> Can you post how you submitted the job?
> Mira on 60 cores needs MPI in your case. Multi threading works w/o
>
> BTW. Your config says 31cpus. Generated without incr index or intended?
>
> Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger <
> pierre.schneeber...@gmail.com>:
> >Hi there,
> >
> >I'm setting up a small cluster composed of 4 blades with 32 (physical)
> >cores and 750 Gb RAM each (so a total of 128 cores and approx 3 Tb
> >RAM). A
> >CentOS 7 VM is running on each blade.
> >The slurm controller service is up and running on one of the blade, and
> >the
> >daemon service has been installed on each of the four blades (up and
> >running as well).
> >
> >A few days ago, I submitted a job using the MIRA assembler
> >(multithreaded)
> >on 60 cores and it worked well, using all the resources I allocated to
> >the
> >job. At that point, only 2 blades (including the one with the
> >controller)
> >were running and the job was completed successfully using 60 cores when
> >needed.
> >
> >The problem appeared when I added the 2 last blades and it seems that
> >it
> >doesn't matter how much resources (cores) I allocate to a job, it now
> >runs
> >on a maximum of 32 cores (the number of physical cores per node).
> >I tried it with 60, 90 and 120 cores but MIRA, according to the system
> >monitor from CentOS, seem to use only a maximum of 32 cores (all cores
> >from
> >one node but none of the others that were allocated). Is it possible
> >that
> >there is a communication issue between the nodes? (although all seem
> >available when using the sinfo command).
> >
> >I tried to restart the different services (controller/slaves) but it
> >doesn't seem to help.
> >
> >I would be grateful if someone could give me a hint on how to solve
> >this
> >issue,
> >
> >Many thanks in advance,
> >Pierre
> >
> >Here is the *slurm.conf* information:
> >
> ># slurm.conf file generated by configurator easy.html.
> ># Put this file on all nodes of your cluster.
> ># See the slurm.conf man page for more information.
> >#
> >ControlMachine=hpc-srvbio-03
> >ControlAddr=192.168.12.12
> >#
> >#MailProg=/bin/mail
> >MpiDefault=none
> >#MpiParams=ports=#-#
> >ProctrackType=proctrack/pgid
> >ReturnToService=1
> >SlurmctldPidFile=/var/run/slurmctld.pid
> >#SlurmctldPort=6817
> >SlurmdPidFile=/var/run/slurmd.pid
> >#SlurmdPort=6818
> >SlurmdSpoolDir=/var/spool/slurmd
> >SlurmUser=root
> >#SlurmdUser=root
> >StateSaveLocation=/var/spool/slurmctld
> >SwitchType=switch/none
> >TaskPlugin=task/none
> >#
> >#
> ># TIMERS
> >#KillWait=30
> >#MinJobAge=300
> >#SlurmctldTimeout=120
> >#SlurmdTimeout=300
> >#
> >#
> ># SCHEDULING
> >FastSchedule=1
> >SchedulerType=sched/backfill
> >#SchedulerPort=7321
> >SelectType=select/linear
> >#
> >#
> ># LOGGING AND ACCOUNTING
> >AccountingStorageType=accounting_storage/filetxt
> >ClusterName=cluster
> >#JobAcctGatherFrequency=30
> >JobAcctGatherType=jobacct_gather/none
> >#SlurmctldDebug=3
> >#SlurmctldLogFile=
> >#SlurmdDebug=3
> >#SlurmdLogFile=
> >#
> >#
> ># COMPUTE NODES
> >#NodeName=Nodes[1-4] CPUs=31 State=UNKNOWN
> >PartitionName=HPC_test Nodes=hpc-srvbio-0[3-4],HPC-SRVBIO-0[1-2]
> >Default=YES MaxTime=INFINITE State=UP
> >NodeName=DEFAULT CPUs=31 RealMemory=75 TmpDisk=36758
> >NodeName=hpc-srvbio-03 NodeAddr=192.168.12.12
> >NodeName=hpc-srvbio-04 NodeAddr=192.168.12.13
> >NodeName=HPC-SRVBIO-02 NodeAddr=192.168.12.11
> >NodeName=HPC-SRVBIO-01 NodeAddr=192.168.12.10
>
> --
> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.HTML
> vox: +49 3641 9 44323 | fax: +49 3641 9 44321
>


[slurm-dev] scch not found !!

2016-02-10 Thread David Roman
Hello,

I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch not 
found.
I don't know what I do to install this file ? Is it a part of BLCR or SLURM or 
other things ?

Thanks a lot for your reply.

David


[slurm-dev] Re: scch not found !!

2016-02-10 Thread Manuel Rodríguez Pascual
hi,

You have to install BLCR first. It is quite straightforward though. You can
download it from the Berkley Lab web at
http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/




2016-02-10 11:21 GMT+01:00 David Roman :

> Hello,
>
>
>
> I try to use checkpoint in slurm. But I have an error about /usr/sbin/scch
> not found.
>
> I don’t know what I do to install this file ? Is it a part of BLCR or
> SLURM or other things ?
>
>
>
> Thanks a lot for your reply.
>
>
>
> David
>



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN


[slurm-dev] AllowGroups and AD

2016-02-10 Thread Diego Zuccato

Hello all.

I think I'm doing something wrong, but I don't understand what.

I'm trying to limit users allowed to use a partition (that, coming from
Torque, I think is the equivalent of a queue), but obviously I'm failing. :(

Frontend and work nodes are all Debians joined to AD via Winbind (that
ensures consistent UID/GID mapping, at the expense of having many groups
and a bit of slowness while looking 'em up).
On every node I can run 'id' and it says (redacted):
uid=108036(diego.zuccato) gid=100013(domain_users)
gruppi=100013(domain_users),[...],242965(str957.tecnici),[...]

(it takes about 10s to get the complete list of groups).

Linux ACLs work as expected (if I set a file to be readable only by
Str957.tecnici I can read it), but when I do
scontrol update PartitionName=pp_base AllowGroups=str957.tecnici
or even
scontrol update PartitionName=pp_base AllowGroups=242965

when I try to sbath a job I get:
diego.zuccato@Str957-cluster:~$ sbatch aaa.sh
sbatch: error: Batch job submission failed: User's group not permitted
to use this partition
diego.zuccato@Str957-cluster:~$ newgrp Str957.tecnici
diego.zuccato@Str957-cluster:~$ sbatch aaa.sh
sbatch: error: Batch job submission failed: User's group not permitted
to use this partition

So I won't get recognized even if I change my primary GID :(

I've been in that group since way before installing the cluster, and I
already tried rebooting everyting to refresh the cache.

Another detail that can be useful:
diego.zuccato@Str957-cluster:~$ time getent group str957.tecnici
str957.tecnici:x:242965:[...],diego.zuccato,[...]

real0m0.012s
user0m0.000s
sys 0m0.000s

Any hints?

TIA

-- 
Diego Zuccato
Servizi Informatici
Dip. di Fisica e Astronomia (DIFA) - Universitᅵ di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
mail: diego.zucc...@unibo.it


[slurm-dev] Re: slurmd ownership

2016-02-10 Thread Jonathon A Anderson
slurmd must run as root because it forks and execs processes on behalf of other 
users using the job owner’s uid.

I don’t understand what trouble you’re having monitoring slurm with nagios. 
Could you give an example of what you’re trying to do, what you expect it to 
do, and what it does in stead?

~jonathon


> On Feb 10, 2016, at 6:30 PM, jupiter  wrote:
> 
> Hi,
> 
> I am running slurm on CentOS 6. One thing I've just noticed is that the 
> slurmctld is running under the user slurm but the slurmd is running under the 
> root. Not quite sure why those daemons in different owner ships, any inside 
> explanation please? Further looked at both slurm daemon configure in 
> /etc/init.d and slurm.conf, they are identiccal, how could they behave 
> differently? Anyway, can the owner ship of slurmd be changed to the user 
> slurm? 
> 
> The problem I've got now is I am running nagios monitoring via ssh, it can 
> check all other application daemon status, but it always failed to check 
> slrum daemon status due to the slurmd root access restriction. 
> 
> Thanks.
> 
> 



[slurm-dev] slurmd ownership

2016-02-10 Thread jupiter
Hi,

I am running slurm on CentOS 6. One thing I've just noticed is that the
slurmctld is running under the user slurm but the slurmd is running under
the root. Not quite sure why those daemons in different owner ships, any
inside explanation please? Further looked at both slurm daemon configure in
/etc/init.d and slurm.conf, they are identiccal, how could they behave
differently? Anyway, can the owner ship of slurmd be changed to the user
slurm?

The problem I've got now is I am running nagios monitoring via ssh, it can
check all other application daemon status, but it always failed to check
slrum daemon status due to the slurmd root access restriction.

Thanks.