[slurm-dev] slurmdb error

2017-04-26 Thread Mahmood Naderan
Hi, Is there any idea about this error [root@cluster ~]# sacctmgr -i create cluster Rocks-Cluster sacctmgr: error: slurmdbd: Sending DbdInit msg: Unable to connect to database sacctmgr: error: Problem talking to the database: Unable to connect to database -- Regards, Mahmood

[slurm-dev] Querying historic CPU utilisation, equivalent to `sinfo`?

2017-04-26 Thread David William James Perry
All, I'm collecting some usage metrics for our cluster, and I'd like to look at utilisation in terms of allocated CPU % by partition, basically equivalent of `sinfo -O cpusstate -p partition_name`, but for historic data. What's the best way to do this? I've found that running `sacct

[slurm-dev] Re: slurmdb error

2017-04-26 Thread Mahmood Naderan
>you need to have slurmdbd running. Is it running? [root@cluster ~]# ps aux | grep slurmdb root 3406 0.0 0.0 338636 2672 ?Sl 00:26 0:01 /usr/sbin/slurmdbd root 17146 0.0 0.0 105308 888 pts/2S+ 13:26 0:00 grep slurmdb >Is slurm.conf pointing to the right

[slurm-dev] Re: slurm-dev Asymmetric resources and --multi-prog

2017-04-26 Thread Hendryk Bockelmann
Hello, this featute is implemented in "Job Packs" and was presented at SLUG 2016: https://slurm.schedmd.com/SLUG16/Job_Packs_SUG_2016.pdf nevertheless, it seems not to be in slurm-17 ... Regards, Hendryk On 26.04.2017 14:12, Malte Thoma wrote: Hi Kolja, AFAIK it is not possible with slurm

[slurm-dev] slurm-dev Asymmetric resources and --multi-prog

2017-04-26 Thread bjoern.mielsch
Dear all, I'm new to Slurm and clusters in general and I'm currently charged with a task which, I would imagine, is rather simple but I can't figure it out. I need to start several processes using --multi-prog but assign different amounts of resources to each process. (i.e. asymmetric

[slurm-dev] Re: slurm-dev Asymmetric resources and --multi-prog

2017-04-26 Thread Malte Thoma
Hi Kolja, AFAIK it is not possible with slurm YET :'-( If you should find out anything we would be VERY interested in an example ;-). Regards, Malte Am 26.04.2017 um 13:35 schrieb bjoern.miel...@iwes.fraunhofer.de: Dear all, I'm new to Slurm and clusters in general and I'm currently

[slurm-dev] Slurm job priorities

2017-04-26 Thread Baker D . J .
Hello, I guess that may a simple question for someone more experienced with slurm scheduling than us. When jobs are queuing in our cluster we find that we get a lot of these messages in our slurmctld.log error: Job 25766 priority exceeds 32 bits I cannot find any mention or discussion of this

[slurm-dev] Re: Slurm job priorities

2017-04-26 Thread Loris Bennett
Hi David, Baker D.J. writes: > Hello, > > I guess that may a simple question for someone more experienced with slurm > scheduling than us. When jobs are queuing in our cluster we find that we get > a lot of these messages in our slurmctld.log > > error: Job 25766

[slurm-dev] Re: slurmdb error

2017-04-26 Thread Jeff Tan
Hi Mahmood > [root@cluster ~]# ps aux | grep slurmdb > root 3406 0.0 0.0 338636 2672 ?Sl 00:26 0:01 > /usr/sbin/slurmdbd > root 17146 0.0 0.0 105308 888 pts/2S+ 13:26 0:00 grep slurmdb That's good. What does its /var/log/slurm/slurmdbd.log say? Any errors? >

[slurm-dev] RE: Slurm job priorities

2017-04-26 Thread Baker D . J .
Hi Loris, Thank you for your reply. The output from "sprio -l" is: JOBID USER PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOSNICE TRES 25988 mjp1m12 -922337203 2nan 1 1000 0 0

[slurm-dev] Re: slurmdb error

2017-04-26 Thread Wensheng Deng
Also check to see if munge is functioning properly. On Wed, Apr 26, 2017 at 10:00 AM, Jeff Tan wrote: > Hi Mahmood > > > [root@cluster ~]# ps aux | grep slurmdb > > root 3406 0.0 0.0 338636 2672 ?Sl 00:26 0:01 > > /usr/sbin/slurmdbd > > root 17146

[slurm-dev] RE: Slurm job priorities

2017-04-26 Thread Loris Bennett
Hi David, Baker D.J. writes: > Hi Loris, > > Thank you for your reply. The output from "sprio -l" is: > > JOBID USER PRIORITYAGE FAIRSHAREJOBSIZE > PARTITIONQOSNICE TRES > 25988 mjp1m12 -922337203

[slurm-dev] Jobs immediately fail

2017-04-26 Thread Vlad Firoiu
I've noticed that when I run jobs on lowest priority, some fraction (almost always the second half of a job array) fail immediately. In sacct this is what I see: 8365358_98 agent_0_v+ om_all_no+2TIMEOUT 1:0 8365358_98.+ batch2

[slurm-dev] Re: jobs killed after 24h though walltime is 7 days

2017-04-26 Thread Andy Riebs
Is there a time limit set on the queue (rather than the user)? On 04/26/2017 12:57 PM, Uwe Sauter wrote: Hi all, I have a mysterios situation where a user's job is killed after 24h though he specified "-t 7-00:00:00" on submission. This happened to several jobs of this user in the last few

[slurm-dev] Re: jobs killed after 24h though walltime is 7 days

2017-04-26 Thread Uwe Sauter
MaxTime on the partition / queue is set to 10 weeks (70-00:00:00). What I forgot to mention is that some month ago, the same account had jobs running successfully for several days. But since then there has been at least two updates within 16.05. Am 26.04.2017 um 19:06 schrieb Andy Riebs:

[slurm-dev] jobs killed after 24h though walltime is 7 days

2017-04-26 Thread Uwe Sauter
Hi all, I have a mysterios situation where a user's job is killed after 24h though he specified "-t 7-00:00:00" on submission. This happened to several jobs of this user in the last few days. The account he's using is has MaxWall set to 7-00:00:00. There is no QoS used. In

[slurm-dev] Re: slurmdb error

2017-04-26 Thread Jeff Tan
Hi Mahmood > [root@cluster ~]# sacctmgr -i create cluster Rocks-Cluster > sacctmgr: error: slurmdbd: Sending DbdInit msg: Unable to connect to database > sacctmgr: error: Problem talking to the database: Unable to connect > to database You need to narrow that down. If you're using sacctmgr, you

[slurm-dev] Slurm node name problem

2017-04-26 Thread Mark Lescroart
Hello, Our slurm control process is having some difficulty with node names. Our computing cluster is set up as a bunch of virtual machines. The names of each node are mda01, mda02, ... (mda[01-64] in slurmspeak). If we ssh to the nodes, `hostname` returns the correct name (same with

[slurm-dev] Re: Memory bounds for jobs

2017-04-26 Thread Alexander Vorobiev
Thanks for the reply! A concrete example is interactive jobs (R, python, etc). I only want the users to request minimum needed amount of memory but then I don't want their session to generate an error if they try to allocate more memory - and the free memory is available in the system. The upper

[slurm-dev] Potential CUDA Memory Allocation Issues

2017-04-26 Thread Samuel Bowerman
Hello Slurm community, Our lab has recently begun transitioning from maui/torque to slurm, but we are having some difficulties getting our configuration correct. In short, our CUDA tests routinely violate the virtual memory limits assigned by the scheduler even though the physical memory space