> -----Original Message----- > From: [email protected] [mailto:owner-slurm- > [email protected]] On Behalf Of Arnau Bria > Sent: Thursday, November 03, 2011 6:52 AM > To: [email protected] > Subject: [slurm-dev] new to slurm > > Hi all, > > My name is Arnau Bria and I work as a sysadmin at PIC (a data center in > Barcelona). We have a cluster of ~300 nodes and 3300 job slots under > torque/maui. Our current scenario, more than 6k jobs, causes serious > problems to torque/maui, so we're studying alternatives and seems that > slurm has same/more torque/maui features and it scales much better. > > So, I'm staring to read some slurm docs and I've been able to install a > server, configure some partitions, a copule of nodes and send some > jobs. I've learned some basic command to manage > partions/nodes/queues/jobs. > > > Now I'd like to start a deeper investigation and I'm trying to > "import" torque's configuration into slurm, and see what still has > sense and what not: > > 1.-) from: > https://computing.llnl.gov/linux/slurm/faq.html#fast_schedule > How can I configure SLURM to use the resources actually found > on a node rather than what is defined in slurm.conf? > > All my nodes (which have 4 cpus) show only 1 cpu. I can't make slurm to > guess node resources automatically. This is my conf: > [...] > SelectType=select/cons_res > SelectTypeParameters=CR_CPU > FastSchedule=0 > [...] > NodeName=DEFAULT State=UNKNOWN > NodeName=tditaller002.pic.es,tditaller005.pic.es > > node log: > [...] > Nov 3 14:40:20 tditaller002 slurmd[8245]: slurmd version 2.3.1 started > Nov 3 14:40:20 tditaller002 slurmd[8245]: slurmd started on Thu 03 Nov > 2011 14:40:20 +0100 > Nov 3 14:40:20 tditaller002 slurmd[8245]: Procs=1 Sockets=1 Cores=1 > Threads=1 Memory=7985 TmpDisk=1990 Uptime=98838
FastSchedule=0 is the appropriate setting. But the slurmd is apparently only seeing one proc. Try running "scontrol show slurmd" on the compute node to confirm. Compare that against /proc/cpuinfo. If the cause is still not obvious, try turning up the SlurmdDebug level and looking for clues in a more verbose log file. > 2.-) CPU_factor > In torque we define cpu_factor. A way to normalize cpu_time between two > differnet hosts. (host A is good, host B bad. So, 1 second in host A > equals to 2 in host B). > > Is this configurable in slurm? what name do you use for that?¿ The closest analog to this is the "Weight" setting in the slurm.conf file. > 3.-) max node load. > May I configure a max amount of load in a node? i.e a node with 4 cpus > will run 4 jobs, but if running 3 it reaches some load I'd like slurm > to NOT send more jobs to that node. Here you should look at the "Shared" partition configuration setting in the slurm.conf man page. Shared=Yes:3 could be appropriate for this scenario. > 4.-) how is the file copy between client/server done? (input/output)? > ssh? NFS= is it configurable? The easiest mechanism is to share files across login and compute nodes is through a shared file system. However, if you need to push files around, look at the sbcast command. > Well, I think I've asked enough questions for my first mail :-) > Could anyone answer some (or all) this questions? Coudl anyone send me > a link to presentations/wiki/extended_doc? http://schedmd.com/slurmdocs/publications.html has some publications that are not bundled with the html pages included in the SLURM distribution. Don > Many thanks in advance, > Cheers, > Arnau
