RE: [slurm-dev] new to slurm

Lipari, Don Fri, 04 Nov 2011 08:48:01 -0700

> -----Original Message-----
> From: [email protected] [mailto:owner-slurm-
> [email protected]] On Behalf Of Arnau Bria
> Sent: Thursday, November 03, 2011 6:52 AM
> To: [email protected]
> Subject: [slurm-dev] new to slurm
> 
> Hi all,
> 
> My name is Arnau Bria and I work as a sysadmin at PIC (a data center in
> Barcelona). We have a cluster of ~300 nodes and 3300 job slots under
> torque/maui. Our current scenario, more than 6k jobs, causes serious
> problems to torque/maui, so we're studying alternatives and seems that
> slurm has same/more torque/maui features and it scales much better.
> 
> So, I'm staring to read some slurm docs and I've been able to install a
> server, configure some partitions, a copule of nodes and send some
> jobs. I've learned some basic command to manage
> partions/nodes/queues/jobs.
> 
> 
> Now I'd like to start a deeper investigation and I'm trying to
> "import" torque's configuration into slurm, and see what still has
> sense and what not:
> 
> 1.-) from:
> https://computing.llnl.gov/linux/slurm/faq.html#fast_schedule
> How can I configure SLURM to use the resources actually found
> on a node rather than what is defined in slurm.conf?
> 
> All my nodes (which have 4 cpus) show only 1 cpu. I can't make slurm to
> guess node resources automatically. This is my conf:
> [...]
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU
> FastSchedule=0
> [...]
> NodeName=DEFAULT State=UNKNOWN
> NodeName=tditaller002.pic.es,tditaller005.pic.es
> 
> node log:
> [...]
> Nov  3 14:40:20 tditaller002 slurmd[8245]: slurmd version 2.3.1 started
> Nov  3 14:40:20 tditaller002 slurmd[8245]: slurmd started on Thu 03 Nov
> 2011 14:40:20 +0100
> Nov  3 14:40:20 tditaller002 slurmd[8245]: Procs=1 Sockets=1 Cores=1
> Threads=1 Memory=7985 TmpDisk=1990 Uptime=98838


FastSchedule=0 is the appropriate setting.  But the slurmd is apparently only 
seeing one proc.  Try running "scontrol show slurmd" on the compute node to 
confirm.  Compare that against /proc/cpuinfo.  If the cause is still not 
obvious, try turning up the SlurmdDebug level and looking for clues in a more 
verbose log file.

> 2.-) CPU_factor
> In torque we define cpu_factor. A way to normalize cpu_time between two
> differnet hosts. (host A is good, host B bad. So, 1 second in host A
> equals to 2 in host B).
> 
> Is this configurable in slurm? what name do you use for that?¿

The closest analog to this is the "Weight" setting in the slurm.conf file.

> 3.-) max node load.
> May I configure a max amount of load in a node? i.e a node with 4 cpus
> will run 4 jobs, but if running 3 it reaches some load I'd like slurm
> to NOT send more jobs to that node.

Here you should look at the "Shared" partition configuration setting in the 
slurm.conf man page.  Shared=Yes:3 could be appropriate for this scenario.

> 4.-) how is the file copy between client/server done? (input/output)?
> ssh? NFS= is it configurable?

The easiest mechanism is to share files across login and compute nodes is 
through a shared file system.  However, if you need to push files around, look 
at the sbcast command.

> Well, I think I've asked enough questions for my first mail :-)
> Could anyone answer some (or all) this questions? Coudl anyone send me
> a link to presentations/wiki/extended_doc?

http://schedmd.com/slurmdocs/publications.html has some publications that are 
not bundled with the html pages included in the SLURM distribution.

Don

> Many thanks in advance,
> Cheers,
> Arnau

RE: [slurm-dev] new to slurm

Reply via email to