slurm version: 14.11.0-0rc2
Setting FirstJobId to a largish number (>99999 ?) crashes slurmctld. It appears
the addition of the hash to the state save location without increasing the
job_dir char buffer is the culprit. This seems to have vastly reduced the max
size of the job ids.
We ran into this as we're spinning up a brand new slurmctld running
v14.11 along side our existing one running v2.6.3 and for reporting
purposes wanted to keep unique job ids. So we set FirstJobId=10000000
which upon running the first job crashed slurmctld:
Here's the relevant gdb backtrace:
Program received signal SIGABRT, Aborted.
0x00007ffff76305c9 in raise () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
slurm-14.11.0-0rc2.el7.centos.x86_64
(gdb) bt
#0 0x00007ffff76305c9 in raise () from /usr/lib64/libc.so.6
#1 0x00007ffff7631cd8 in abort () from /usr/lib64/libc.so.6
#2 0x00007ffff7670dd7 in __libc_message () from /usr/lib64/libc.so.6
#3 0x00007ffff77088f7 in __fortify_fail () from /usr/lib64/libc.so.6
#4 0x00007ffff7706ac0 in __chk_fail () from /usr/lib64/libc.so.6
#5 0x00007ffff7705fe9 in _IO_str_chk_overflow () from /usr/lib64/libc.so.6
#6 0x00007ffff76744ac in __GI__IO_default_xsputn () from /usr/lib64/libc.so.6
#7 0x00007ffff7642dfc in vfprintf () from /usr/lib64/libc.so.6
#8 0x00007ffff7706078 in __vsprintf_chk () from /usr/lib64/libc.so.6
#9 0x00007ffff7705fcd in __sprintf_chk () from /usr/lib64/libc.so.6
#10 0x0000000000441819 in sprintf (__fmt=0x5702b3
"/hash.%d/job.%u/environment", __s=0x7ffffffedf60
"/hash.3/job.10000003/environm") at /usr/include/bits/stdio2.h:33
#11 get_job_env (job_ptr=job_ptr@entry=0x83fba0,
env_size=env_size@entry=0x8519e8) at job_mgr.c:6018
#12 0x0000000000456118 in build_launch_job_msg (job_ptr=job_ptr@entry=0x83fba0,
protocol_version=<optimized out>) at job_scheduler.c:1628
#13 0x0000000000456af3 in launch_job (job_ptr=0x83fba0) at job_scheduler.c:1678
#14 0x000000000045a9c4 in schedule (job_limit=1200, job_limit@entry=0) at
job_scheduler.c:1401
#15 0x000000000042b2ca in _slurmctld_background (no_data=0x0) at
controller.c:1637
#16 main (argc=<optimized out>, argv=<optimized out>) at controller.c:565
Here's a patch that seemed to fix the issue:
diff -ru slurm-14.11.0-0rc2/src/slurmctld/job_mgr.c
slurm-14.11.0-1rc2/src/slurmctld/job_mgr.c
--- slurm-14.11.0-0rc2/src/slurmctld/job_mgr.c 2014-10-17 16:44:21.000000000
-0400
+++ slurm-14.11.0-1rc2/src/slurmctld/job_mgr.c 2014-10-31 13:28:16.124297876
-0400
@@ -6010,7 +6010,7 @@
*/
char **get_job_env(struct job_record *job_ptr, uint32_t * env_size)
{
- char job_dir[30], *file_name, **environment = NULL;
+ char job_dir[35], *file_name, **environment = NULL;
int hash = job_ptr->job_id % 10;
int cc;
The size of the job_dir string minus the job id is 25 and assuming a max job id
of 4294901760 (10 chars in length) we increased this from 30 to 35.
Not sure where else in the code this would effect.
Cheers,
--Andrew
Andrew E. Bruno
Senior Programmer Analyst
Center for Computational Research
SUNY at Buffalo
New York State Center of Excellence
in Bioinformatics and Life Sciences
701 Ellicott St
Buffalo, NY 14203
http://buffalo.edu/ccr