If you are volunteering, then sure.
* Basing the subdirectory off the last digit or two of the job id  
should be easiest
* Code needs to be added to create these new directories either on  
demand or at slurmctld startup
* I would suggest making the new logic conditional upon a SLURM  
build-time option
* Existing directories need to be moved or their jobs will be killed  
when slurmctld restarts using the new logic

Quoting Clay Teeter <[email protected]>:

> It sounds like the second option (partition state on jobid or ...) would be
> a great general solution.  Would people here be interested in a patch for
> this?
>
> Cheers
> Clay
>
> On Wed, May 30, 2012 at 1:03 PM, Moe Jette <[email protected]> wrote:
>
>>
>> Oddly enough, I ran across this problem just yesterday on an old
>> CentOS distro.
>> No great solutions, but here are some options:
>> * Upgrade the OS
>> * Modify SLURM to spread out the job directories into subdirectories,
>> say using a subdirectory based upon the last digit of the job ID. This
>> applies to code in only a couple of places, so it should be pretty
>> simple (search for "/environment" in src/slurmctld/job_mgr.c)
>> * Configure MaxJobs=32000 in slurm.conf and force users reduce the load
>> * The directories are created only for batch jobs, so if you can run
>> interactive jobs (srun/salloc) this limit would not apply
>>
>>
>> Quoting Clay Teeter <[email protected]>:
>>
>> > Thanks for the quick response!  Given that our system is ext3 using a 2.6
>> > kernel, is there anything that we can do to configure slurm not to create
>> > 32K directories/jobs in /var/slurm/state/?
>> >
>> > Cheers,
>> > Clay
>> >
>> > On Wed, May 30, 2012 at 10:56 AM, Moe Jette <[email protected]> wrote:
>> >
>> >>
>> >> See:
>> >> http://superuser.com/questions/298420/cannot-mkdir-too-many-links
>> >>
>> >> With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k rather than
>> >> 32k.
>> >>
>> >> Quoting Clay Teeter <[email protected]>:
>> >>
>> >> > Hi Group,
>> >> >
>> >> > Anyone know how I might troubleshoot this error message?
>> >> >
>> >> > [2012-05-15T19:34:27] _slurm_rpc_submit_batch_job: I/O error writing
>> >> > script/environment to file
>> >> > [2012-05-15T19:34:28] error: mkdir(/var/slurm/state/job.3258740) error
>> >> Too
>> >> > many links
>> >> >
>> >> > Cheers,
>> >> > Clay
>> >> >
>> >>
>> >>
>> >
>>
>>
>

Reply via email to