Hello,
I have been using h_vmem as a consumable resource to limit the amount of
memory users can request and to make sure jobs don't use more than they
requested. It all has been working fine until we added nodes with GPU
modules.
The memory model in CUDA applications is such, that the address space
for virtual memory is expanded to have a single address space for
regular memory and GPU memory (and something else, because I saw
reported memory usage 2-fold of the virtual memory).
This results in jobs getting killed because of exceeding h_vmem, whereas
is fact the actual memory usage was really low. So this effectively
renders h_vmem useless, because it either kills jobs that should not be
killed or needs to be set very high (2-3 time the size of virtual
memory) for the host and jobs to allow jobs to run.
I was wondering if there were good solutions or practices on controlling
memory usage on GPU-equipped nodes. I am using SGE 6.2u5.
Thank you,
Ilya.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users