[slurm-dev] Re: Where are the RPM's?

2013-01-16 Thread David Bigagli
Hi there, perhaps the best way is to build the RPMs yourself from the source code. Please follow the instruction on this page: http://www.schedmd.com/slurmdocs/quickstart_admin.html David On 01/16/2013 09:51 PM, Richard Casey wrote: Hi, On the webpage

[slurm-dev] Re: slurm and ansys mechnical

2013-01-17 Thread David Bigagli
Hi, do the spawn processes also change process group? The process tracker should track all processes sharing the same process group ID. Isn't there an option in ansys to tell it to wait for all its children? Something like: #!/bin/sh sleep 3600 wait David On 01/17/2013 02:30 AM, Sten Wolf

[slurm-dev] Re: When is database created?

2013-01-17 Thread David Bigagli
Hi, you have to restart slurmctld for the database to be created. slurmdbd must be running before you start slurmctld. David On 01/17/2013 06:55 PM, Richard Casey wrote: Hi, We installed slurm 2.5.1 with rpmbuild -ta slurm*.tar.bz2 rpm --installthe rpm files All the

[slurm-dev] Re: Uncleared events

2013-01-22 Thread David Bigagli
Damien, None of your nodes in your output are currently down. Anything with an endtime is something that has been accounted for. If there are events without an endtime (Outside of actual downed nodes or Cluster processor count lines) then those are the ones to look at. It is interesting

[slurm-dev] Re: LSF command wrappers for Slurm?

2013-01-23 Thread David Bigagli
Alessandro, if your jobs leave output files on the execution hosts you can copy them back to the submission host or any other host using the task epilog which runs after each job step. - #!/bin/sh #SBATCH --partition=compute #SBATCH --nodes=2 sbcast -f shello /tmp/shello

[slurm-dev] RE: not executing script(?)

2013-01-25 Thread David Bigagli
I think the idea is that given a script like this one: - cat myenv #!/bin/sh hostname ulimit -a env|sort echo done: `date` - run it as: ssh myhost myenv LOG.ssh and as srun -p mypartition -w myhost myenv LOG.srun then

[slurm-dev] Re: slurmctld crash, no reason

2013-02-01 Thread David Bigagli
Hi, upon startup slurmctld changes its working directory to where the log file is. If the log file is: SlurmctldLogFile=/var/tmp/slurm/slurmctld.log the working directory is /var/tmp/slurm. Assuming your slurmctld core dump for whatever reason the core file should be there. The directory should be

[slurm-dev] Re: Jobs are queued, but still resources left.

2013-02-06 Thread David Bigagli
By default Slurm allocates node in exclusive mode. You have to use Consumable Resources to achieve what you want. http://schedmd.com/slurmdocs/cons_res.html /David On Wed, Feb 6, 2013 at 10:47 AM, Niels Rothermel niels.rother...@googlemail.com wrote: Hello, I think my problem is pretty

[slurm-dev] Re: job cancelled due to node failure

2013-02-06 Thread David Bigagli
I think you have done the steps correctly. What was the error that happened? /David On Wed, Feb 6, 2013 at 12:50 PM, Mario Kadastik mario.kadas...@cern.chwrote: Hi, today I was adding a few nodes to slurm so I added them to slurm.conf nodes definition and then restarted slurm controller.

[slurm-dev] Re: Some nodes are not responding..

2013-02-20 Thread David Bigagli
not responding [2013-02-19T17:42:20] error: Nodes lufer121 not responding [2013-02-19T17:43:59] error: Nodes lufer121 not responding, setting DOWN On 02/19/2013 04:50 PM, David Bigagli wrote: Hi, have a look at the slurmd.log on the none responding hosts. Does it show any relevant information

[slurm-dev] Re: Some nodes are not responding..

2013-02-21 Thread David Bigagli
they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. /David On Thu, Feb 21, 2013 at 10:16 AM, Marcin Stolarek stolarek.mar...@gmail.com wrote: 2013/2/20 David Bigagli da...@schedmd.com Hi, do you share your slurm.conf or each node has its own? All

[slurm-dev] Re: Help gaining result files from SLURM

2013-02-24 Thread David Bigagli
Hi, are those files created by a job? In such case you can use the job epilog to copy the data when the job is done. http://www.schedmd.com/slurmdocs/prolog_epilog.html /David On Sun, Feb 24, 2013 at 6:58 PM, Giulio V. de Musso g...@libero.it wrote: Hi I've setup a SLURM cluster made up

[slurm-dev] Re: automatically resuming a node?

2013-03-01 Thread David Bigagli
Hi, have a look at ReturnToService in 'man slurm.conf'. /David On Fri, Mar 1, 2013 at 9:12 PM, Tim caphrim...@gmail.com wrote: Hi folks, I've having some concern with some of my slurm nodes in that the slighest thing seems to make them go into a down state. Low CPUs, node unexpectedly

[slurm-dev] Re: Get execution time for each job

2013-03-04 Thread David Bigagli
Have a look at 'man sacct'. */David* On Mon, Mar 4, 2013 at 8:18 PM, Giulio V. de Musso g...@libero.it wrote: Hi I run some sbatches on my SLURM cluster. I need to know the execution time for each job. So if for example the job ID is 34 SLURM should create a file named 34.time in which

[slurm-dev] Re: slurmctld in version 2.5.3 is segfaulting in communication with slurmdbd via munge

2013-03-06 Thread David Bigagli
Have you updated the slurmdb daemon first as described here: http://schedmd.com/slurmdocs/quickstart_admin.html */David* On Wed, Mar 6, 2013 at 5:58 PM, Lennart Karlsson lennart.karls...@it.uu.sewrote: Hi, Today I upgraded SLURM from v 2.4.3 to v 2.5.3. It seems like a mistake, because

[slurm-dev] Re: change select() to poll() in src/common/fd.c

2013-03-12 Thread David Bigagli
This is the way select() works regardless of the version of redhat or any other distribution. The fd_set is a bit array defined in sys/select.h of __FD_SETSIZE which is defined as 1024 in bits/typesizes.h */David* On Tue, Mar 12, 2013 at 11:30 AM, Hongjia Cao hj...@nudt.edu.cn wrote: When

[slurm-dev] Re: Memory swapping, and transition delay issues.

2013-03-12 Thread David Bigagli
Hi, the problem of memory over-subscription is discusses in 'man slurm.conf'. Have a look at DefMemPerCPU, DefMemPerNode and the suggested configuration when using CR_CPU_Memory. */David* On Tue, Mar 12, 2013 at 3:15 PM, Joo-Kyung Kim supa...@gmail.com wrote: Hi, ** ** I am using

[slurm-dev] Re: node switching / selection

2013-03-22 Thread David Bigagli
Is it possible the job runs on several nodes, say -N 3, then one node is lost so it ends up running on 2 nodes only? Such a job should have been submitted with ---no-kill. /David On Fri, Mar 22, 2013 at 4:06 PM, Michael Colonno mcolo...@stanford.eduwrote: Actually did mean node below.

[slurm-dev] Re: slurmdbd dies

2013-03-28 Thread David Bigagli
Try to start it so it does not damonize I think the option is -D but better check the man page, see if it core dumps. sent from galaxy nexus On Mar 27, 2013 9:31 AM, Pablo Sanz Mercado pablo.s...@uam.es wrote: Hi Alejandro, Sorry, the messages we obtain about the couldn't

[slurm-dev] Re: Easy Backfilling Plugin for SLURM

2013-04-25 Thread David Bigagli
The available slurm documentation can be found here: http://slurm.schedmd.com */David* On Thu, Apr 25, 2013 at 11:41 AM, David Bigagli da...@schedmd.com wrote: Hi, slurm.conf has the following parameter as documented in the slurm.conf man page: max_job_bf

[slurm-dev] Re: Slurmctld multithreaded?

2013-06-12 Thread David Bigagli
Hi, the gstack command will show you the activities of each thread in the slurmctld process. This is an example: david@prometeo ~gstack 14432 Thread 8 (Thread 0x7fa9c9190700 (LWP 14433)): #0 0x0035b90acb8d in nanosleep () from /lib64/libc.so.6 #1 0x0035b90aca00 in sleep () from

[slurm-dev] Re: slurm integration with FlexLM license manager

2013-07-02 Thread David Bigagli
Indeed currently there is no integration between Flexlm and SLURM, but some ideas are being passed around what to do about it. I am one of the original designers and developers of Platform License Scheduler. The item 1) you mentioned is certainly the first step but consider even that may not be

[slurm-dev] Re: slurm integration with FlexLM license manager

2013-07-02 Thread David Bigagli
of anything different, I, for one, would be happy to know. Gary D. Brown On Tue, Jul 2, 2013 at 8:38 AM, David Bigagli da...@schedmd.com wrote: Indeed currently there is no integration between Flexlm and SLURM, but some ideas are being passed around what to do about it. I am one of the original

[slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0

2013-07-25 Thread David Bigagli
Hello, it is a requirement to specify --mpi=pmi2 otherwise the srun will not load the pmi2 library implementing the server side pmi2 functionalities. There was a error in the contribs/pmi2/pmi2_api.c causing the 'no value for req' message, this was the -1488 remaining_len -=

[slurm-dev] Re: Overtime job exit code

2013-09-10 Thread David Bigagli
Hi, this issue has been fixed in the 2.6.2 release. On 09/10/2013 09:01 AM, Michael Gutteridge wrote: We allow jobs to overrun their wall time via OverTimeLimit. We've noticed that jobs that complete successfully but go over the wall time are reported as having JobState=TIMEOUT in the job

[slurm-dev] Re: Bug in Slurm time=days-hours:minutes parsing?

2013-10-02 Thread David Bigagli
Sounds good. :-) Thanks for the patch it is going to be in Slurm 2.6.3. On 10/02/2013 12:28 AM, Mark Nelson wrote: Hi All, It does look like there is a bug in time_str2secs(): If we give it a time of format: days-0:min, we exit the for loop with days set, but with min set to our hours value

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread David Bigagli
Hi, I don't know the details of the segfault but the code in question is correct. If you decrease the length then the file cmdlen: cmdlen = PMII_MAX_COMMAND_LEN - remaining_len; will not be correct and wrong length will be sent to the pmi2 server. This code is taken verbatim from

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread David Bigagli
to reflect the reduced size of the c buffer. On Oct 3, 2013, at 9:44 AM, David Bigagli da...@schedmd.com wrote: Hi, I don't know the details of the segfault but the code in question is correct. If you decrease the length then the file cmdlen: cmdlen = PMII_MAX_COMMAND_LEN - remaining_len

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread David Bigagli
Fixed. I chose the method proposed by Michael, subtract first, add later. :-) On 10/03/2013 10:29 AM, Ralph Castain wrote: On Oct 3, 2013, at 10:16 AM, David Bigagli da...@schedmd.com wrote: I am not saying that remaining_len is correct or that mpich is bugless :-) I am only saying

[slurm-dev] Re: Interactive Jobs Not Launching Under High Load

2013-10-25 Thread David Bigagli
, /David/Bigagli www.schedmd.com voice: +1 415 320 2776

[slurm-dev] Re: Patch to contribs/torque/pbsnodes.pl to work more like TORQUE pbsnodes command

2013-11-12 Thread David Bigagli
offline (-o), reset (-r), clear (-c), and -N (set note/reason) command line options * adds the -l (brief list) and -n (list with notes) command line options * format the output in the default verbose list mode more like TORQUE's pbsnodes does --Troy -- Thanks, /David/Bigagli

[slurm-dev] Re: slurm_terminate_job versus slurm_kill_job

2014-01-29 Thread David Bigagli
is getting the job done. Is there a reason that slurm_kill_job shouldn't be used? Thanks Michael -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: performance improvement for hostlist

2014-03-20 Thread David Bigagli
hostlist_push with hostlist_push_host (commit 1b0b135f9579e253ddd5bf680d2ea70ad12f9bda) fixes the problem of sinfo, but I think the root cause is in xmalloc. -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: performance improvement for hostlist

2014-03-20 Thread David Bigagli
. The multithread patch (commit 17449c066af69441b741110ef51fc2f534272871) does not help. Replacing hostlist_push with hostlist_push_host (commit 1b0b135f9579e253ddd5bf680d2ea70ad12f9bda) fixes the problem of sinfo, but I think the root cause is in xmalloc. -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: performance improvement for hostlist

2014-03-21 Thread David Bigagli
; for (i = 0; i COUNT; i++) { ptr = malloc(MAX_RANGES * sizeof(struct _range)); free(ptr); } END_TIMER; info(malloc usecs: %lld, delta_t); return 0; } 在 2014-03-20四的 10:42 -0700,David Bigagli写道: Hi, do you have any data

[slurm-dev] Re: Gres GPU Problem with new slurm cluster

2014-03-31 Thread David Bigagli
, -J -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Patch for squeue submit time

2014-04-02 Thread David Bigagli
output. There are very few format chars left so I just picked a free one. Thanks Martins -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Website correction

2014-04-11 Thread David Bigagli
numbered cores on a node only. DTILICENSE_ONLY/IDD -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working

2014-04-11 Thread David Bigagli
: aborting, io error with slurmstepd on node 0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete Launching with salloc/sbatch works. - Anthony -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working

2014-04-11 Thread David Bigagli
Errata corrige. The core file is in the log directory. On 04/11/2014 12:08 PM, David Bigagli wrote: Hi, this Slurm bug has been fixed and it will be available in 14.03.1 which will be released soon. Otherwise it is available in the HEAD. You should find a core file of slurmstepd

[slurm-dev] Re: Fwd: slurm installation

2014-04-25 Thread David Bigagli
StorageType=accounting_storage/mysql #StorageHost=localhost #StoragePort=1234 StoragePass=slurm_pass StorageUser=slurm StorageLoc=slurm_acct_db -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: srun interactive failure after upgrade

2014-05-20 Thread David Bigagli
on its own and not within an salloc, is no longer supported and expected to fail? Thanks Martins On 5/20/14 1:23 PM, David Bigagli wrote: In 14.03 you should use the SallocDefaultCommand as documented in http://slurm.schedmd.com/slurm.conf.html to srun with the --pty option. On 05/19/2014 10

[slurm-dev] Re: Segfault at slutmctld-14.03.3-2 start on CentOS 5.10

2014-05-28 Thread David Bigagli
} dir_name = value optimized out Remembering some previous problems I suspect that some uninitialised variable in some structure (which represents some omitted option in slurmd.conf) may cause such effect. Could someone please give me some hints? Thanks! -- Thanks, /David

[slurm-dev] Re: Interactive job array

2014-07-17 Thread David Bigagli
you do to run an interactive job array ? By interactive, I mean that the command only exit at the end of the array ? Regards, Julien -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: How to interpret sacct output

2014-08-05 Thread David Bigagli
showing that he effectively used 162 cpu seconds (81 x 2). Thank you, Robert On 8/5/2014 9:53 AM, David Bigagli wrote: Hi Robert, the first line is the allocation and the second the batch step, the batch step runs on one cpu. I am not sure what are the columns with values 162 and 81

[slurm-dev] Re: How to interpret sacct output

2014-08-11 Thread David Bigagli
of the batch step higher than either of the two job steps? Thank you, Robert On 8/5/2014 1:41 PM, David Bigagli wrote: Yes that is correct. The first entry is the allocation which has 2 cpus, -n 2 was specified, the second entry is the batch step that run for 81 seconds, so the total cpu time used

[slurm-dev] Re: cgroup freezer throwing Device or resource busy upon job cancel or kill - 14.03.6

2014-08-13 Thread David Bigagli
Slurm version and same kernel. Cheers, -- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: cgroup freezer throwing Device or resource busy upon job cancel or kill - 14.03.6

2014-08-13 Thread David Bigagli
Interesting indeed. Let me have a look at it and experiment with it a bit. On 08/13/2014 04:16 PM, Kilian Cavalotti wrote: On Wed, Aug 13, 2014 at 10:00 AM, David Bigagli da...@schedmd.com wrote: For some reason at the first attempt rmdir(2) returns EBUSY. Would writing

[slurm-dev] Re: cgroup freezer throwing Device or resource busy upon job cancel or kill - 14.03.6

2014-08-18 Thread David Bigagli
Unfortunately the article refers to the memory sub system which gets removed without problem. The issue happens on the freezer, however it is just an error message without consequences. On 08/13/2014 04:16 PM, Kilian Cavalotti wrote: On Wed, Aug 13, 2014 at 10:00 AM, David Bigagli da

[slurm-dev] Re: sacctmgr

2014-09-16 Thread David Bigagli
-- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: jobfilter plugin

2014-09-16 Thread David Bigagli
. The slurm.conf line reads: JobSubmitPlugins=default in compliance with the documentation. Thanks for your help Eva -- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: Bug (?) and Bugfix in update reservation

2014-09-22 Thread David Bigagli
Köln, Von-der-Wettern-Strasse 27 Telefon: +49 (0) 2203 305-0 Telefax: +49 (0) 2203 305-1699 http://www.bull.de Bull, an Atos company ** Folgen Sie uns auf Twitter: http://twitter.com/bull_de ** Bull Firmenprofil bei XING: https://www.xing.com/companies/bullgmbh -- Thanks, /David/Bigagli

[slurm-dev] Re: jobfilter plugin

2014-09-24 Thread David Bigagli
[2014-09-24T13:47:52.151] error: cannot find job_submit plugin for job_submit/defaults [2014-09-24T13:47:52.151] error: cannot create job_submit context for job_submit/defaults [2014-09-24T13:47:52.151] fatal: failed to initialize job_submit plugin On Tue, 16 Sep 2014, David Bigagli wrote

[slurm-dev] Re: Determining hostname of srun given job_id/step_id

2014-10-10 Thread David Bigagli
process resides given a known job_id/step_id? Thanks, Andrew -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Determining hostname of srun given job_id/step_id

2014-10-10 Thread David Bigagli
I think the question was about the submission node, the node where the srun/sbatch was executed from. On 10/10/2014 04:14 PM, Franco Broi wrote: Are we talking about alloc_node? You can retrieve it using the perl api. On 11 Oct 2014 06:53, David Bigagli da...@schedmd.com wrote: Hi

[slurm-dev] Re: Documentation mismatch: man pages / html

2014-10-20 Thread David Bigagli
, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Database incompletely configured

2014-10-31 Thread David Bigagli
of the database? I have been grepping through the source tree, but I haven't stumbled on the script that creates the tables and columns needed. ~Charles~ -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Database incompletely configured

2014-10-31 Thread David Bigagli
: On 10/31/2014 12:49 PM, David Bigagli wrote: The database is created by the slurmdbd daemon. Have you granted access to the database to the slurm user? Yes, I did a grant all on slurm_acct_db.* TO 'slurm'@'localhost'; -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path

2014-11-06 Thread David Bigagli
srun directly as we get poor scaling, the next thing in the list (after SC14) is to migrate to Open-MPI 1.8.4 which is due out shortly which should address this. cheers, Chris -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path

2014-11-07 Thread David Bigagli
I agree. Done in commit c8f34560c87cfbbf. On 11/06/2014 06:46 PM, Christopher Samuel wrote: On 07/11/14 11:53, David Bigagli wrote: Hi, Hiya David, it used to logged at debug level in 2.6 and now it is an error. This seems to be an issue with cgroups which does not allow that path

[slurm-dev] Re: Programmatically submit a job

2014-11-07 Thread David Bigagli
() is a stringly typed interface, and it would be nice if I could use an interface with stronger types. Cheers, Walter Landry -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: _slurm_cgroup_destroy message?

2014-11-18 Thread David Bigagli
, Department for Research Computing, University of Oslo -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: QOS reporting question

2014-12-18 Thread David Bigagli
what amount of time within a QOS. sacct can give me information on an account level but I can't seem to get it to report on a QOS level on a user by user bases. Thanks Jackie -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Two patches for jobacct_gather.

2015-02-06 Thread David Bigagli
for us that have working cgroups memory limits. Best regards, Magnus Jonsson -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Small bug in scontrol output

2015-01-28 Thread David Bigagli
) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=./test.sh WorkDir=/home/adm17 StdErr=/home/adm17/test-e%j.txt - here %j is not expanded StdIn=/dev/null StdOut=/home/adm17/test-o12032.txt Regards, Uwe -- Thanks, /David/Bigagli

[slurm-dev] Re: possible bug in srun --unbuffered option

2015-03-09 Thread David Bigagli
Yes it works with or without --unbuffered. I don't think data are buffered inside of Slurm. On 03/09/2015 10:15 AM, Lipari, Don wrote: -Original Message- From: David Bigagli [mailto:da...@schedmd.com] Sent: Thursday, March 05, 2015 10:49 AM To: slurm-dev Subject: [slurm-dev] Re

[slurm-dev] Re: successful systemd service start on RHEL7?

2015-03-24 Thread David Bigagli
unloaded. Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon. -- Subject: Unit slurmctld.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: successful systemd service start on RHEL7?

2015-03-25 Thread David Bigagli
The slurm.spec file decides if to install the init.d scripts or the systemd stuff. On 03/24/2015 07:24 PM, Fred Liu wrote: -Original Message- From: David Bigagli [mailto:da...@schedmd.com] Sent: 星期三, 三月 25, 2015 1:19 To: slurm-dev Subject: [slurm-dev] Re: successful systemd

[slurm-dev] Re: Problems running job

2015-03-30 Thread David Bigagli
people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli
by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon- -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli
trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli
where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and then issue

[slurm-dev] Re: Two problems with 14.11.4

2015-03-04 Thread David Bigagli
fine). Problem number 2: 'scontrol show jobs' shows jobs in state RUNNING that don't actually appear to exist. Some of these are days old. What might be going on here? -- Jon Nelson Dyn / Senior Software Engineer p. +1 (603) 263-8029 -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread David Bigagli
of these issues. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 tel:%2B358503841576 || janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm and PMI2

2015-04-27 Thread David Bigagli
but to search for the library name looks like a reasonable start to me. I would hope you can help me with this. Thanks, Ulf -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm and PMI2

2015-04-27 Thread David Bigagli
for the library name looks like a reasonable start to me. I would hope you can help me with this. Thanks, Ulf -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Inadvertent access to all qos'

2015-05-14 Thread David Bigagli
, if I go and create a new qos, all users instantly can utilize this qos. This is very strange, I wonder if some setting has been munged in the database somewhere mistakenly? Any ideas? Thanks. Best, Chris -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: What is SPANK logging function slurm_debug?

2015-04-17 Thread David Bigagli
, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Question about prologging

2015-04-14 Thread David Bigagli
the file actually being there. I've even set extended ACL's on the directory so that the SlurmUser can see all of the files (sudo -u slurm ls -lR failed with permission denied). Could anyone tell me why the slurm_script file cannot be read via prolog? Thank you! John DeSantis -- Thanks, /David

[slurm-dev] Re: Segfault when using accounting_storage/filetxt in 15.08.0-0pre4

2015-06-04 Thread David Bigagli
with any of the changes I made. Is this a known problem with some kind of workaround, or should I file a bug on it? Eric -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: slurm questions - limit paging?

2015-08-12 Thread David Bigagli
Is there a way to limit paging outside of Slurm? There are memory limits in Slurm but no paging limit. There is a backup controller in Slurm, you can read about it here: http://slurm.schedmd.com/slurm.conf.html Thanks /David/Bigagli da...@schedmd.com

[slurm-dev] Re: module environment

2015-06-29 Thread David Bigagli
module is not a command in /bin. To make it work you have to source the module startup file in your script, for example: . /usr/local/Modules/3.2.10/init/bash then you can use the module file. On Jun 29, 2015, at 4:43 PM, Antonia Mey antonia@gmail.com wrote: Dear all, I may have a

[slurm-dev] Re: race condition in stepd i/o shutdown

2015-07-27 Thread David Bigagli
I think that Hydra should kept 0,1,2 open and dup them to /dev/null so that any children’s file descriptor will be greater than 2. This is standard Unix way. Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16

[slurm-dev] Re: fix of segfault in srun

2015-10-07 Thread David Bigagli
Fixed in 15.08.2. commit 30a5d6778fc86f8799cefc4fbea4f9ae7eac8d92 Author: Hongjia Cao <hj...@nudt.edu.cn> Date: Wed Oct 7 15:05:24 2015 +0200 Thanks for your contribution. On 10/07/2015 12:15 PM, Hongjia Cao wrote: attached. -- Thanks, /David/Bigagli da...@schedmd.com

[slurm-dev] Re: PMI2 in Slurm 14.11.8 ?

2015-09-07 Thread David Bigagli
Those pmi2 files are the server side of the pmi2 protocol implemented in slurmstepd, those are always installed. Is the client side that is the one that get’s installed from the contribs directory. Thanks /David/Bigagli da...@schedmd.com

[slurm-dev] Re: scontrol show conf crashes slurmctld 14.11.9

2015-09-08 Thread David Bigagli
Can you show us the stack using gdb ? Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16 September 2015, Washington D.C. http://slurm.schedmd.com/slurm_ug_agenda.html > On 08 Sep 2015, at 16:45, Mar

[slurm-dev] Re: Limiting count of array tasks started during backfill

2015-09-07 Thread David Bigagli
y to 4. The minimum index value is 0. the maximum value is one less than the configura- tion parameter MaxArraySize. Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16 September 2015,