I'm looking for suggestions for dealing with h_vmem requirements for
multi-slot jobs.

We use memory as a consumable and a required complex.

I understand that SGE multiplies the h_vmem request by the number of slots
in order to determine the job memory requirement.

In our environment, there are a processing pipelines that take parameters
to control the number of child processes launched by the job.

For these jobs, the high-point in memory use is independent of the number
of child processes.

For example, a job will begin with a single-threaded section that uses
2GB of RAM, then launch "N" child processes that use 500MB each, then
finish with a section that assembles the results of the child processes
and requires 8GB.

That example job would be submitted with the option "-l h_vmem=8G".

Users are aware that they must give a "-pe threaded N" parameter to SGE
when they run the job with "N" child processes.

Typically, users will run this type of job with either zero or between
3~6 child processes.

I've written a JSV to divide the user-supplied h_vmem value by the number of
slots, and reset h_vmem. This allows users to avoid recalculating memory
requirement whenever they submit a job with more than 1 child process.

This causes a problem if the JSV-calculated h_vmem value is lower than
then actual memory use and SGE kills the job, for example, if the job
described above is submitted with:

        -pe threaded 6 -l h_vmem=8G

the JSV readjusts h_vmem to 1.5G, and the job is killed with it tries
to use 8GB.

Without the JSV, when users submit these jobs with a parameter to launch
multiple child processes (with the corresponding "-pe threaded" option),
SGE will set a higher-than-needed memory requirement (48GB in the above
example).  This means that the job cannot be scheduled if it appears
to exceed the memory of our largest server.  If the job can be run,
it usually will wait a long time for a machine sufficient memory to be
available and then it blocks other users from running jobs on the same
node because SGE treats memory as a consumable.

Is there a way to tell SGE not to multiply the user-supplied h_vmem
request by the number of requested slots?

Is there another parameter that could be changed within the JSV to preserve
the user-supplied h_vmem and prevent SGE from trying to require excessive
memory?

Do you have any suggestions in terms of user training and education to
explain this situation so that they can submit single- or multi-slot
jobs with appropriate memory requests?

Thanks,

Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to