[slurm-dev] Re: SLURM user authentication error(s)

2012-08-09 Thread Moe Jette
It does reload slurm.conf, but some things can't be changed without restarting daemons. As I recall, the slurm.conf man page identifies what can't be changed without restarting daemmons. Quoting gugga 4u gugg...@gmail.com: Ok, I'm all set ... apparently scontrol reconfig does NOT reload

[slurm-dev] Re: node load reporting

2012-08-08 Thread Moe Jette
. However, slurmd only wakes up and processes when it receives a message from slurmctld. AFAICT, it'd only reevaluate cpu load (i.e. run through the get* functions) when sent a reconfigure message. Or have I got that wrong? Thanks M On Tue, Aug 7, 2012 at 10:01 AM, Moe Jette je...@schedmd.com

[slurm-dev] Re: Prolog epilog execution

2012-08-08 Thread Moe Jette
See the section labeled Prolog and Epilog Scripts in the slurm.conf man page http://www.schedmd.com/slurmdocs/slurm.conf.html especially the second note. Quoting gugga 4u gugg...@gmail.com: Hi, I am evaluating slurm for a project at my work. I cannot find enough documentation on several

[slurm-dev] Re: node load reporting

2012-08-07 Thread Moe Jette
SLURM is designed to allocate and use resource (e.g. CPUs and memory) rather than monitor CPU load and use that as a basis for scheduling. Although that feature has been requested in the past, there are no immediate plans to work on it. Right now the load average is not collected, although

[slurm-dev] SLURM version 2.4.2 is now available

2012-08-01 Thread Moe Jette
SLURM version 2.4.2 is now available from http://www.schedmd.com/#repos It includes many bug fixes, most of which IBM BlueGene specific. * Changes in SLURM 2.4.2 -- BLUEGENE - Correct potential deadlock issue when hardware goes bad and there are jobs running on

[slurm-dev] Re: backfill scheduler now sets a job's start time even if nodes are unavailable

2012-08-01 Thread Moe Jette
Not setting the time here does not mean there is no start time, only that any time set in the previous execution of the backfill scheduler is unchanged, which is probably no longer correct. Note there is a variable later_start that attempts to start jobs at later times based upon when

[slurm-dev] Re: Setting up a test cluster

2012-07-12 Thread Moe Jette
Quoting hgiesel...@us.nanya.com: I have SLURM 2.3.3 installed and running in production but have found that I need to do some more testing and tweaking before I can migrate all of our LSF jobs to SLURM. I would like to install a test cluster but am unsure about the following: 1.

[slurm-dev] Re: --cpu_bind=none still binds jobs to CPUs

2012-07-11 Thread Moe Jette
Quoting Phil Sharfstein psharfst...@vrinc.com: With TaskPlugin=task/affinity (using hwloc), when I launch a job with --cpu_bind=none, the job still gets launched explicitly bound to CPUs. Looking at src/plugins/task/affinity/dist_tasks.c, it seems like this behavior is by-design in

[slurm-dev] Re: Creating reservations based on a weight variable

2012-07-10 Thread Moe Jette
! I'll be more than happy to contribute and add it. Can you point me to the exiting logic and to an explanation on how to start with code modifications. Best regards Yuval. From: Moe Jette je...@schedmd.com To: slurm-dev slurm-dev@schedmd.com Sent: Monday

[slurm-dev] RE: openmpi, cgroups slurm (unexpected behaviour)

2012-07-10 Thread Moe Jette
You do not identify what version of OpenMPI and different versions operate very differently (See http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi). You might run scontrol show step and scontrol show -d job to see what CPUs are allocated by SLURM to the job and the job step. The

[slurm-dev] Re: Slurm loosing resources

2012-07-04 Thread Moe Jette
CPUAlloc comes from he alloc_cpus field on the node data structure and that represents the number of bits set in the row_bitmap field. The relevent code is in the select plugin (probably select/cons_res). Adding logging after jobs are allocated resources (look for add_job_to_cores) and

[slurm-dev] Re: Problems upgrading to 2.4.0

2012-07-04 Thread Moe Jette
moving from previous releases to 2.4.1 do not have any plugins configured. Nancy From: Moe Jette je...@schedmd.com To: slurm-dev slurm-dev@schedmd.com, Date: 07/04/2012 10:48 AM Subject:[slurm-dev] Re: Problems upgrading to 2.4.0 Did the srun start on v2.3, but not get

[slurm-dev] Re: Gang Scheduling

2012-06-29 Thread Moe Jette
= topology/none TrackWCKey = 0 TreeWidth = 50 UsePam = 1 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime= 0 sec On 06/28/2012 05:53 PM, Moe Jette wrote: Do you have Shared

[slurm-dev] FW: slurm and mpich2

2012-06-29 Thread Moe Jette
SLURM's PMI library gets the task rank from the environment variable SLURM_PROCID. I believe that you are not using that library. You can at least confirm that the environment variable is set correctly by running something like this: $ srun -n4 -l printenv SLURM_PROCID | sort 0: 0 1: 1 2: 2

[slurm-dev] Re: Gang Scheduling

2012-06-28 Thread Moe Jette
Do you have Shared option configured for the partitions as instructed here: http://www.schedmd.com/slurmdocs/gang_scheduling.html Quoting Sefa Arslan sefa.ars...@tubitak.gov.tr: I have hollpwing configuration for Gang Scheduling: SchedulerTimeSlice = 30 sec SchedulerType =

[slurm-dev] Re: LSF to SLURM limits

2012-06-20 Thread Moe Jette
The SLURM limits available are described here: http://www.schedmd.com/slurmdocs/resource_limits.html Quoting hgiesel...@us.nanya.com: Hello all, I am in the process of converting our small (13 nodes) LSF 7 cluster to SLURM 2.3.3 and am running into some challenges. We do not use

[slurm-dev] Re: Performance about slurmctld?

2012-06-07 Thread Moe Jette
SLURM version 2.5 will be much faster, but is under development and may be unstable. You can get it from github (in the master branch): https://github.com/SchedMD/slurm Quoting Chi Shin Hsu for.shin1...@gmail.com: Hi, I have a project using slurm. Now there are 10 nodes in the cluster,

[slurm-dev] Re: Enforcing nodes to be multiples of CPUs

2012-06-06 Thread Moe Jette
You can do that with a SLURM job submit plugin. I would recommend doing it with a script. More information here: http://www.schedmd.com/slurmdocs/job_submit_plugins.html Quoting Daniel Adriano Silva M dadri...@gmail.com: Hi Developers, In order to reduce cluster fragmentation, when using

[slurm-dev] Re: Enforcing nodes to be multiples of CPUs

2012-06-06 Thread Moe Jette
User-made plugins? Daniel 2012/6/6 Moe Jette je...@schedmd.com You can do that with a SLURM job submit plugin. I would recommend doing it with a script. More information here: http://www.schedmd.com/slurmdocs/job_submit_plugins.html Quoting Daniel Adriano Silva M dadri...@gmail.com

[slurm-dev] Re: Job launch process

2012-06-06 Thread Moe Jette
Most of the RPC are logged using debug2() messages, so increasing the SlurmctldDebug value by one should make this much more clear and you should see something like this: Job allocation: slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1 001 slurmctld: debug2:

[slurm-dev] Re: Problem with quotes in sched/wiki2 plugin

2012-06-06 Thread Moe Jette
My recollection is this change was made to address someone submitting a job in which the working directory contained a #. When Moab read job state information from SLURM, it interpreted the # as a job separator and could not parse anything after that point. Quoting the working directory

[slurm-dev] Re: SLURM for nighly use

2012-06-04 Thread Moe Jette
SLURM preemption it mostly designed to stop lower priority jobs when higher priority jobs are available. There is a web page about job preemption here: http://www.schedmd.com/slurmdocs/preempt.html If you want to stop and start SLURM jobs at specific times. your best option may be to use a

[slurm-dev] FW: SLURM Question

2012-06-04 Thread Moe Jette
No problem running batch and interactive jobs on the same nodes. Judging from the error message: srun: error: Task launch for 442.0 failed on node svthwVCS: Invalid job credential The job credential includes the node name(s) that it is valid for. I would guess there is some problem with

[slurm-dev] Re: Splitting a large SMP machine into nodes

2012-06-03 Thread Moe Jette
See http://www.schedmd.com/slurmdocs/faq.html#multi_slurmd Configure each slurmd as having the resources of the NUMA socket and bind to that socket using cpusets or linux cgroup Quoting Artem Kulachenko artem.kulache...@gmail.com: Hi Everyone, We have one middle size SMP machine with NUMA

[slurm-dev] Re: Non-exclusive use for cpu resource

2012-06-02 Thread Moe Jette
Quoting Chi Shin Hsu for.shin1...@gmail.com: Hi, The slurm default resource policy is exclusive mode for all resource. Can I use other to control my resource? For example, one node can execute 1000 jobs. Because my tasks are not use much cpu resource but waste time. I have tried to

[slurm-dev] Re: I/O error writing script/environment to file

2012-05-31 Thread Moe Jette
Quoting Clay Teeter teet...@gmail.com: Sure, I'll volunteer. Comments inline On Wed, May 30, 2012 at 3:03 PM, Moe Jette je...@schedmd.com wrote: If you are volunteering, then sure. * Basing the subdirectory off the last digit or two of the job id should be easiest For clarity

[slurm-dev] Re: slurmd will not start with GresType=gpu

2012-05-31 Thread Moe Jette
I believe this is fixed in v2.3.1. This is from NEWS: -- Do not treat the absence of a gres.conf file as a fatal error on systems configured with GRES, but set GRES counts to zero. Quoting martin.pe...@bull.com: A user has reported the following problem on 2.2.7. If gpu resources are

[slurm-dev] Re: I/O error writing script/environment to file

2012-05-30 Thread Moe Jette
See: http://superuser.com/questions/298420/cannot-mkdir-too-many-links With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k rather than 32k. Quoting Clay Teeter teet...@gmail.com: Hi Group, Anyone know how I might troubleshoot this error message? [2012-05-15T19:34:27]

[slurm-dev] Re: startup script patch to allow $BINDIR in /etc/sysconfig/slurm

2012-05-30 Thread Moe Jette
This will be included with SLURM v2.4.0. We will probably not release any more version 2.3.x. Thanks! Quoting Andy Wettstein wettst...@uchicago.edu: Hi, This patch moves the check for scontrol after sourcing /etc/sysconfig/slurm. This allows the $BINDIR to be set in the config file.

[slurm-dev] Re: I/O error writing script/environment to file

2012-05-30 Thread Moe Jette
at 10:56 AM, Moe Jette je...@schedmd.com wrote: See: http://superuser.com/questions/298420/cannot-mkdir-too-many-links With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k rather than 32k. Quoting Clay Teeter teet...@gmail.com: Hi Group, Anyone know how I might troubleshoot

[slurm-dev] Re: I/O error writing script/environment to file

2012-05-30 Thread Moe Jette
be interested in a patch for this? Cheers Clay On Wed, May 30, 2012 at 1:03 PM, Moe Jette je...@schedmd.com wrote: Oddly enough, I ran across this problem just yesterday on an old CentOS distro. No great solutions, but here are some options: * Upgrade the OS * Modify SLURM to spread out the job

[slurm-dev] Re: Is default for NodeAddr NodeName or NodeHostname?

2012-05-25 Thread Moe Jette
The code matches what you report below. The documentation is just out of date and will be corrected in the next release. Thanks. Quoting Bjørn-Helge Mevik b.h.me...@usit.uio.no: According to man slurm.conf, the default for NodeAddr is NodeName: By default, the NodeAddr will be

[slurm-dev] Re: how to get slurm to output a nodelist in order of decreasing %allocated?

2012-05-21 Thread Moe Jette
Chris, This is close to what you want: sinfo -N -S %C -o %N %C I think the sorting is based upon a string comparison, but could be easily altered to be based upon the percentage allocated. Quoting Chris Harwell super...@gmail.com: Hi, I must be missing it. What is the sinfo or squeue

[slurm-dev] Re: SLURM Amazon EC2 Integration

2012-05-15 Thread Moe Jette
Quoting Jharrod W. LaFon jharrod.la...@gmail.com: Hello, You may or may not be aware of a free utility called StarCluster ( http://web.mit.edu/star/cluster/docs/latest/index.html), which completely allocates and configures clusters on Amazon's EC2. It also provides the ability to grow or

[slurm-dev] Re: Minor bug in display of Partition Name in sinfo

2012-05-15 Thread Moe Jette
')) { to } else if (field[0] == 'E') { Don -Original Message- From: Moe Jette [mailto:je...@schedmd.com] Sent: Thursday, April 26, 2012 2:45 PM To: slurm-dev Subject: [slurm-dev] Re: Minor bug in display of Partition Name in sinfo I reverted the change in SLURM v2.3, but left it in SLURM

[slurm-dev] Re: Slurm not allocating jobs in node (IDLE STATE all time)

2012-05-11 Thread Moe Jette
-Original Message- From: Moe Jette [mailto:je...@schedmd.com] Sent: Thursday, May 10, 2012 10:28 PM To: slurm-dev Subject: [slurm-dev] Re: Slurm not allocating jobs in node (IDLE STATE all time) The squeue command should report the reason for the jobs not running. Quoting Tal

[slurm-dev] Re: Dependent Job Prioritization Bug

2012-05-09 Thread Moe Jette
This parch will be in SLURM v2.3.5 when released. Note this change would not be applicable to SLURM v2.4 (job priorities are not adjusted for non-runnable jobs). Thanks! Quoting Lipari, Don lipa...@llnl.gov: The symptom is that SLURM schedules lower priority jobs to run when higher

[slurm-dev] Re: Minor bug in display of Partition Name in sinfo

2012-04-25 Thread Moe Jette
This change will be in SLURM v2.3.5 and v2.4.0-pre5. Thanks! Quoting don.alb...@bull.com: There is a minor problem with the display of partition names in sinfo. Without options, the partition name field displays a asterisk * at the end of the name of the Default partition. If you specify

[slurm-dev] Re: Sharing GPU memory (gpu_mem)

2012-04-25 Thread Moe Jette
, Sergio Iserte Agut sise...@uji.es ha escrit: Thank you for you quick answer, I will get on with it! Regards! El 23/04/2012 17:28, Moe Jette je...@schedmd.com escribió: The current logic allows a GRES to be allocated to one job at a time, however you could develop a new plugin to do what you

[slurm-dev] Re: Problems with --enable-bgq-emulation and AllowSubBlockAllocations=Yes

2012-04-16 Thread Moe Jette
How about a backtrace of the core file? Quoting Mark Nelson mdnels...@gmail.com: Hi All, I'm trying to play with SLURM emulating a Blue Gene /Q but I find that whenever I set AllowSubBlockAllocations=Yes in my bluegene.conf it causes the SLURM control daemon to segfault when trying to

[slurm-dev] Re: Job scheduler does not start jobs on an idle partition

2012-04-13 Thread Moe Jette
SLURM was originally designed to optimize resource allocations for large and long-running jobs rather than large numbers of short-lived jobs. Complicating the logic is that nodes can be configured in more than one partition and jobs can be submitted to more than one partition. Jobs in

[slurm-dev] RE: CPU detection failure when using CR_Core

2012-04-11 Thread Moe Jette
Uptime=34115 Tal -Original Message- From: Moe Jette [mailto:je...@schedmd.com] Sent: Saturday, April 07, 2012 3:27 AM To: slurm-dev; Tal Hazan Subject: Re: [slurm-dev] CPU detection failure when using CR_Core You probably have FastSchedule=1 or 2. Set to 0 if you want information

[slurm-dev] RE: CPU detection failure when using CR_Core

2012-04-11 Thread Moe Jette
have redhat 6.2 with X5690 cpu's. -Original Message- From: Moe Jette [mailto:je...@schedmd.com] Sent: Wednesday, April 11, 2012 8:20 PM To: slurm-dev Subject: [slurm-dev] RE: CPU detection failure when using CR_Core Exactly what operating system and processor does your system have

[slurm-dev] Re: Incorrect job rejection related to cons_res, topology and srun -N option

2012-03-28 Thread Moe Jette
The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range

[slurm-dev] Re: adding multi-cores to cluster

2012-03-26 Thread Moe Jette
All that you have to do is define the new nodes in slurm.conf (Specify the hardware configuration (Sockets, CoresPerSocket, ThreadsPerCore, Memory, CPUs, etc), add a new Partition line to the configuration, install SLURM on the new nodes, and restart the daemons to use the new slurm.conf

[slurm-dev] Re: SLURM and patched mpich1

2012-03-26 Thread Moe Jette
SLURM's mpich1 plugin is designed to start one task per node. The patched mpich1 library starts a number of tasks on that node based upon SLURM environment variables, so that seems to be where the problem is. Please respond with details and/or a patch when you resolve this. Quoting Taras

[slurm-dev] SLURM versions 2.3.4 and 2.4.0-pre4 are now available

2012-03-19 Thread Moe Jette
SLURM versions 2.3.4 and 2.4.0-pre4 are now available from http://www.schedmd.com/#repos A description of the changes is appended. * Changes in SLURM 2.3.4 -- Set DEFAULT flag in partition structure when slurmctld reads the configuration file. Patch from Rémi

[slurm-dev] Re: --switch or --switches?

2012-03-16 Thread Moe Jette
Thanks Rod. This will be in the next release. Quoting rod.schu...@bull.com: Moe and Don, It looks like changes were made to the man pages. However, --switch is still used for the info, usage, and help strings. The attached patch fixes those. Rod From: Lipari, Don

[slurm-dev] Fw: Slurmctld segmentation fault after scontrol reconfigure

2012-03-01 Thread Moe Jette
Thanks for the analysis Martin. A failed job which completed normally could probably fall through the same logic resulting in a seg fault. I would recommend handling this failure by adding a check to build_feature_list() for the detail_ptr being null, which would address any job (FAILED

[slurm-dev] Re: Slurm(dbd), DRBD and Pacemaker

2012-02-29 Thread Moe Jette
This patch should fix the problem with a control_machine containing more than one host name. https://github.com/SchedMD/slurm/commit/10916457cb7a23390c04b097ebb7d1a0e9971195.patch Without the patch, the configuration from Pär should also work. Quoting Pär Andersson pa...@nsc.liu.se: Matteo

[slurm-dev] Re: srun: error: Unable to create job step: Invalid generic resource (gres) specification

2012-02-29 Thread Moe Jette
YOu can find an introduction to SLURM (Quick Start User Guide) here: http://www.schedmd.com/slurmdocs/quickstart.html When srun is executed within an existing allocation (salloc or sbatch), it starts a job step that can only use resources already allocated to that job (no GPUs in the job

[slurm-dev] Re: Network error _slurm_connect failed: Connection refused

2012-02-28 Thread Moe Jette
Your slurmctld daemon exited right away. Check your slurmctld log. IYour partition is configured (twice) with nodes linux[0-11] that do not exist. Quoting DIAM code distribution DIAM/CDRH/FDA diamc...@gmail.com: I am trying to install and start SLURM on a single node with 12 cpus (just

<    1   2   3   4