Ok, here is a patch (git diff) that adds "EnableSafeStatePartition" to the slurm.conf file as we discussed.
I would love some code review :) Cheers, Clay On Thu, May 31, 2012 at 8:35 AM, Moe Jette <[email protected]> wrote: > > Quoting Clay Teeter <[email protected]>: > > > Sure, I'll volunteer. Comments inline > > > > On Wed, May 30, 2012 at 3:03 PM, Moe Jette <[email protected]> wrote: > > > >> > >> If you are volunteering, then sure. > >> * Basing the subdirectory off the last digit or two of the job id > >> should be easiest > >> > > > > For clarity, this is what is requested. > > /45/xxxx45 > > /45/yyyyy45 > > /46/xxxx46 > > /46/yyyyy46 > > Perfect. One digit would get you up to about 300k jobs, which SLURM > would struggle to handle today (although that is changing). Two digits > as shown above should be good for any workload that SLURM is likely to > ever see. > > > >> * Code needs to be added to create these new directories either on > >> demand or at slurmctld startup > >> * I would suggest making the new logic conditional upon a SLURM > >> build-time option > >> > > > > Why wouldn't this be a slurmd.conf option? Seems easier and more > flexible > > then a build option. > > My concern was that someone changes the configuration back and forth. > If the directory locations are reconsiled at startup (both to and from > the extra subdirectories), then making this a configuration option is > good. > > > >> * Existing directories need to be moved or their jobs will be killed > >> when slurmctld restarts using the new logic > >> > > > > On restart, job directories would be reconciled. > > > > > >> Quoting Clay Teeter <[email protected]>: > >> > >> > It sounds like the second option (partition state on jobid or ...) > would > >> be > >> > a great general solution. Would people here be interested in a patch > for > >> > this? > >> > > >> > Cheers > >> > Clay > >> > > >> > On Wed, May 30, 2012 at 1:03 PM, Moe Jette <[email protected]> wrote: > >> > > >> >> > >> >> Oddly enough, I ran across this problem just yesterday on an old > >> >> CentOS distro. > >> >> No great solutions, but here are some options: > >> >> * Upgrade the OS > >> >> * Modify SLURM to spread out the job directories into subdirectories, > >> >> say using a subdirectory based upon the last digit of the job ID. > This > >> >> applies to code in only a couple of places, so it should be pretty > >> >> simple (search for "/environment" in src/slurmctld/job_mgr.c) > >> >> * Configure MaxJobs=32000 in slurm.conf and force users reduce the > load > >> >> * The directories are created only for batch jobs, so if you can run > >> >> interactive jobs (srun/salloc) this limit would not apply > >> >> > >> >> > >> >> Quoting Clay Teeter <[email protected]>: > >> >> > >> >> > Thanks for the quick response! Given that our system is ext3 > using a > >> 2.6 > >> >> > kernel, is there anything that we can do to configure slurm not to > >> create > >> >> > 32K directories/jobs in /var/slurm/state/? > >> >> > > >> >> > Cheers, > >> >> > Clay > >> >> > > >> >> > On Wed, May 30, 2012 at 10:56 AM, Moe Jette <[email protected]> > >> wrote: > >> >> > > >> >> >> > >> >> >> See: > >> >> >> http://superuser.com/questions/298420/cannot-mkdir-too-many-links > >> >> >> > >> >> >> With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k > rather > >> than > >> >> >> 32k. > >> >> >> > >> >> >> Quoting Clay Teeter <[email protected]>: > >> >> >> > >> >> >> > Hi Group, > >> >> >> > > >> >> >> > Anyone know how I might troubleshoot this error message? > >> >> >> > > >> >> >> > [2012-05-15T19:34:27] _slurm_rpc_submit_batch_job: I/O error > >> writing > >> >> >> > script/environment to file > >> >> >> > [2012-05-15T19:34:28] error: mkdir(/var/slurm/state/job.3258740) > >> error > >> >> >> Too > >> >> >> > many links > >> >> >> > > >> >> >> > Cheers, > >> >> >> > Clay > >> >> >> > > >> >> >> > >> >> >> > >> >> > > >> >> > >> >> > >> > > >> > >> > > > >
enable_state_save_partition.diff
Description: Binary data
