from:"John Thiltges"

[slurm-dev] Requeued jobs and problems with slurmdbd accounting of job steps

2013-08-28 Thread John Thiltges

Hi folks, We're seeing what looks like incorrect step accounting in the MySQL database when we're using preemption and PreemptMode=REQUEUE. When a job is requeued, a new job_table row is created with a new job_db_inx. Next, a step_table record is created and uses the new job_db_inx, instead of

[slurm-dev] Re: AllowGroups broken in Slurm 2.6.0?

2013-08-19 Thread John Thiltges

On 08/18/2013 11:51 PM, Christopher Samuel wrote: On 19/08/13 12:02, Christopher Samuel wrote: Chasing through the code it looks like getgrnam_r() fails in get_group_members() in src/slurmctld/groups.c on our Slurm 2.6 boxes (on RHEL 6.4) Scratch that, restarting slurmctld doesn't provoke the

[slurm-dev] Re: cgroups usage

2013-08-05 Thread John Thiltges

On 08/05/2013 06:52 PM, Kevin Abbey wrote: I started using cgroups for control memory usage last week. One user reported his application takes 4 times longer to complete. I read elsewhere that cgroup mem. control can reduce performance. Is this amount realistic? Is there a more efficient met

[slurm-dev] Re: slurm 2.6 partition node specification problem

2013-07-10 Thread John Thiltges

p) and 6-16] (which is > > actually no node name at all but a wrong parsing after the failure) > > > > Thanks > > Eva > > > > On Wed, 10 Jul 2013, John Thiltges wrote: > > > >> On 07/10/2013 06:16 PM, Eva Hocks wrote: >>> The ent

[slurm-dev] Re: slurm 2.6 partition node specification problem

2013-07-10 Thread John Thiltges

On 07/10/2013 06:16 PM, Eva Hocks wrote: > The entry in partiton.conf: > PartitionName=CLUSTER Default=yes State=UP > nodes=gpu-[1]-[4-17],gpu-[2]-[4,6-16],gpu-[3]-[9] > > > causes slurmctl to crash: > > 2013-07-10T16:03:22.923] error: find_node_record: lookup failure for > gpu-[2]-[4] > [2013-0

[slurm-dev] Re: Job Groups

2013-06-19 Thread John Thiltges

On 06/19/2013 10:36 AM, Paul Edmon wrote: > I have a group here that wants to submit a ton of jobs to the queue, but > want to restrict how many they have running at any given time so that > they don't torch their fileserver. The licenses feature might work OK for this. Create a license for the

[slurm-dev] Re: sacctmgr: error: Problem talking to the database: Connection refused

2013-06-18 Thread John Thiltges

On 06/18/2013 05:53 PM, Eva Hocks wrote: > any sacct command returns: > > slurmctld.log:[2013-06-18T14:45:57-07:00] error: Association database > appears down, reading from state file. In slurm.conf, are you using "AccountingStorageType=accounting_storage/slurmdbd" ? If so, check that slurmdbd

[slurm-dev] slurm-dev Re: Memory Issues

2013-06-18 Thread John Thiltges

I think we've seen this problem as well. When using ThreadsPerCore along with DefMemPerCpu or --mem-per-cpu, the worker node doesn't set the memory limit correctly. Setting --mem instead should work OK. I have a lightly tested patch that appears to address the issue, but I've only tried it on

[slurm-dev] Re: X11 forwarding for interactive jobs?

2013-02-12 Thread John Thiltges

On 02/11/2013 12:20 AM, Christopher Samuel wrote: > One of the nice things in Torque that a number of our users use is > interactive jobs with X11 forwarding so they can run various codes with > graphical interfaces on the compute nodes (such as VMD, or MATLAB, etc). Hi Chris, To solve this when

[slurm-dev] Requeued jobs and problems with slurmdbd accounting of job steps

[slurm-dev] Re: AllowGroups broken in Slurm 2.6.0?

[slurm-dev] Re: cgroups usage

[slurm-dev] Re: slurm 2.6 partition node specification problem

[slurm-dev] Re: slurm 2.6 partition node specification problem

[slurm-dev] Re: Job Groups

[slurm-dev] Re: sacctmgr: error: Problem talking to the database: Connection refused

[slurm-dev] slurm-dev Re: Memory Issues

[slurm-dev] Re: X11 forwarding for interactive jobs?

9 matches

Site Navigation

Mail list logo

Footer information