Sure, I'll volunteer. Comments inline On Wed, May 30, 2012 at 3:03 PM, Moe Jette <[email protected]> wrote:
> > If you are volunteering, then sure. > * Basing the subdirectory off the last digit or two of the job id > should be easiest > For clarity, this is what is requested. /45/xxxx45 /45/yyyyy45 /46/xxxx46 /46/yyyyy46 > * Code needs to be added to create these new directories either on > demand or at slurmctld startup > * I would suggest making the new logic conditional upon a SLURM > build-time option > Why wouldn't this be a slurmd.conf option? Seems easier and more flexible then a build option. > * Existing directories need to be moved or their jobs will be killed > when slurmctld restarts using the new logic > On restart, job directories would be reconciled. > Quoting Clay Teeter <[email protected]>: > > > It sounds like the second option (partition state on jobid or ...) would > be > > a great general solution. Would people here be interested in a patch for > > this? > > > > Cheers > > Clay > > > > On Wed, May 30, 2012 at 1:03 PM, Moe Jette <[email protected]> wrote: > > > >> > >> Oddly enough, I ran across this problem just yesterday on an old > >> CentOS distro. > >> No great solutions, but here are some options: > >> * Upgrade the OS > >> * Modify SLURM to spread out the job directories into subdirectories, > >> say using a subdirectory based upon the last digit of the job ID. This > >> applies to code in only a couple of places, so it should be pretty > >> simple (search for "/environment" in src/slurmctld/job_mgr.c) > >> * Configure MaxJobs=32000 in slurm.conf and force users reduce the load > >> * The directories are created only for batch jobs, so if you can run > >> interactive jobs (srun/salloc) this limit would not apply > >> > >> > >> Quoting Clay Teeter <[email protected]>: > >> > >> > Thanks for the quick response! Given that our system is ext3 using a > 2.6 > >> > kernel, is there anything that we can do to configure slurm not to > create > >> > 32K directories/jobs in /var/slurm/state/? > >> > > >> > Cheers, > >> > Clay > >> > > >> > On Wed, May 30, 2012 at 10:56 AM, Moe Jette <[email protected]> > wrote: > >> > > >> >> > >> >> See: > >> >> http://superuser.com/questions/298420/cannot-mkdir-too-many-links > >> >> > >> >> With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k rather > than > >> >> 32k. > >> >> > >> >> Quoting Clay Teeter <[email protected]>: > >> >> > >> >> > Hi Group, > >> >> > > >> >> > Anyone know how I might troubleshoot this error message? > >> >> > > >> >> > [2012-05-15T19:34:27] _slurm_rpc_submit_batch_job: I/O error > writing > >> >> > script/environment to file > >> >> > [2012-05-15T19:34:28] error: mkdir(/var/slurm/state/job.3258740) > error > >> >> Too > >> >> > many links > >> >> > > >> >> > Cheers, > >> >> > Clay > >> >> > > >> >> > >> >> > >> > > >> > >> > > > >
