Quoting Clay Teeter <[email protected]>:

> Sure, I'll volunteer.   Comments inline
>
> On Wed, May 30, 2012 at 3:03 PM, Moe Jette <[email protected]> wrote:
>
>>
>> If you are volunteering, then sure.
>> * Basing the subdirectory off the last digit or two of the job id
>> should be easiest
>>
>
> For clarity, this is what is requested.
> /45/xxxx45
> /45/yyyyy45
> /46/xxxx46
> /46/yyyyy46

Perfect. One digit would get you up to about 300k jobs, which SLURM  
would struggle to handle today (although that is changing). Two digits  
as shown above should be good for any workload that SLURM is likely to  
ever see.


>> * Code needs to be added to create these new directories either on
>> demand or at slurmctld startup
>> * I would suggest making the new logic conditional upon a SLURM
>> build-time option
>>
>
> Why wouldn't this be a slurmd.conf option?  Seems easier and more flexible
> then a build option.

My concern was that someone changes the configuration back and forth.  
If the directory locations are reconsiled at startup (both to and from  
the extra subdirectories), then making this a configuration option is  
good.


>> * Existing directories need to be moved or their jobs will be killed
>> when slurmctld restarts using the new logic
>>
>
> On restart, job directories would be reconciled.
>
>
>> Quoting Clay Teeter <[email protected]>:
>>
>> > It sounds like the second option (partition state on jobid or ...) would
>> be
>> > a great general solution.  Would people here be interested in a patch for
>> > this?
>> >
>> > Cheers
>> > Clay
>> >
>> > On Wed, May 30, 2012 at 1:03 PM, Moe Jette <[email protected]> wrote:
>> >
>> >>
>> >> Oddly enough, I ran across this problem just yesterday on an old
>> >> CentOS distro.
>> >> No great solutions, but here are some options:
>> >> * Upgrade the OS
>> >> * Modify SLURM to spread out the job directories into subdirectories,
>> >> say using a subdirectory based upon the last digit of the job ID. This
>> >> applies to code in only a couple of places, so it should be pretty
>> >> simple (search for "/environment" in src/slurmctld/job_mgr.c)
>> >> * Configure MaxJobs=32000 in slurm.conf and force users reduce the load
>> >> * The directories are created only for batch jobs, so if you can run
>> >> interactive jobs (srun/salloc) this limit would not apply
>> >>
>> >>
>> >> Quoting Clay Teeter <[email protected]>:
>> >>
>> >> > Thanks for the quick response!  Given that our system is ext3 using a
>> 2.6
>> >> > kernel, is there anything that we can do to configure slurm not to
>> create
>> >> > 32K directories/jobs in /var/slurm/state/?
>> >> >
>> >> > Cheers,
>> >> > Clay
>> >> >
>> >> > On Wed, May 30, 2012 at 10:56 AM, Moe Jette <[email protected]>
>> wrote:
>> >> >
>> >> >>
>> >> >> See:
>> >> >> http://superuser.com/questions/298420/cannot-mkdir-too-many-links
>> >> >>
>> >> >> With Ubuntu 12.4 (Linux 3.2.0-24) the limit is at least 200k rather
>> than
>> >> >> 32k.
>> >> >>
>> >> >> Quoting Clay Teeter <[email protected]>:
>> >> >>
>> >> >> > Hi Group,
>> >> >> >
>> >> >> > Anyone know how I might troubleshoot this error message?
>> >> >> >
>> >> >> > [2012-05-15T19:34:27] _slurm_rpc_submit_batch_job: I/O error
>> writing
>> >> >> > script/environment to file
>> >> >> > [2012-05-15T19:34:28] error: mkdir(/var/slurm/state/job.3258740)
>> error
>> >> >> Too
>> >> >> > many links
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Clay
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >
>>
>>
>

Reply via email to