Ok, here is a patch (git diff) that adds "EnableSafeStatePartition" to the
slurm.conf file as we discussed.

I would love some code review :)

Cheers,
Clay

On Thu, May 31, 2012 at 8:35 AM, Moe Jette <[email protected]> wrote:

>
> Quoting Clay Teeter <[email protected]>:
>
> > Sure, I'll volunteer.   Comments inline
> >
> > On Wed, May 30, 2012 at 3:03 PM, Moe Jette <[email protected]> wrote:
> >
> >>
> >> If you are volunteering, then sure.
> >> * Basing the subdirectory off the last digit or two of the job id
> >> should be easiest
> >>
> >
> > For clarity, this is what is requested.
> > /45/xxxx45
> > /45/yyyyy45
> > /46/xxxx46
> > /46/yyyyy46
>
> Perfect. One digit would get you up to about 300k jobs, which SLURM
> would struggle to handle today (although that is changing). Two digits
> as shown above should be good for any workload that SLURM is likely to
> ever see.
>
>
> >> * Code needs to be added to create these new directories either on
> >> demand or at slurmctld startup
> >> * I would suggest making the new logic conditional upon a SLURM
> >> build-time option
> >>
> >
> > Why wouldn't this be a slurmd.conf option?  Seems easier and more
> flexible
> > then a build option.
>
> My concern was that someone changes the configuration back and forth.
> If the directory locations are reconsiled at startup (both to and from
> the extra subdirectories), then making this a configuration option is
> good.
>
>
> >> * Existing directories need to be moved or their jobs will be killed
> >> when slurmctld restarts using the new logic
> >>
> >
> > On restart, job directories would be reconciled.
> >
> >
> >> Quoting Clay Teeter <[email protected]>:
> >>
> >> > It sounds like the second option (partition state on jobid or ...)
> would
> >> be
> >> > a great general solution.  Would people here be interested in a patch
> for
> >> > this?
> >> >
> >> > Cheers
> >> > Clay
> >> >
> >> > On Wed, May 30, 2012 at 1:03 PM, Moe Jette <[email protected]> wrote:
> >> >
> >> >>
> >> >> Oddly enough, I ran across this problem just yesterday on an old
> >> >> CentOS distro.
> >> >> No great solutions, but here are some options:
> >> >> * Upgrade the OS
> >> >> * Modify SLURM to spread out the job directories into subdirectories,
> >> >> say using a subdirectory based upon the last digit of the job ID.
> This
> >> >> applies to code in only a couple of places, so it should be pretty
> >> >> simple (search for "/environment" in src/slurmctld/job_mgr.c)
> >> >> * Configure MaxJobs=32000 in slurm.conf and force users reduce the
> load
> >> >> * The directories are created only for batch jobs, so if you can run
> >> >> interactive jobs (srun/salloc) this limit would not apply
> >> >>
> >> >>
> >> >> Quoting Clay Teeter <[email protected]>:
> >> >>
> >> >> > Thanks for the quick response!  Given that our system is ext3
> using a
> >> 2.6
> >> >> > kernel, is there anything that we can do to configure slurm not to
> >> create
> >> >> > 32K directories/jobs in /var/slurm/state/?
> >> >> >
> >> >> > Cheers,
> >> >> > Clay
> >> >> >
> >> >> > On Wed, May 30, 2012 at 10:56 AM, Moe Jette <[email protected]>
> >> wrote:
> >> >> >
> >> >> >>
> >> >> >> See:
> >> >> >> http://superuser.com/questions/298420/cannot-mkdir-too-many-links
> >> >> >>
> >> >> >> With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k
> rather
> >> than
> >> >> >> 32k.
> >> >> >>
> >> >> >> Quoting Clay Teeter <[email protected]>:
> >> >> >>
> >> >> >> > Hi Group,
> >> >> >> >
> >> >> >> > Anyone know how I might troubleshoot this error message?
> >> >> >> >
> >> >> >> > [2012-05-15T19:34:27] _slurm_rpc_submit_batch_job: I/O error
> >> writing
> >> >> >> > script/environment to file
> >> >> >> > [2012-05-15T19:34:28] error: mkdir(/var/slurm/state/job.3258740)
> >> error
> >> >> >> Too
> >> >> >> > many links
> >> >> >> >
> >> >> >> > Cheers,
> >> >> >> > Clay
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >
> >>
> >>
> >
>
>

Attachment: enable_state_save_partition.diff
Description: Binary data

Reply via email to