Hi,

Am 29.11.2013 um 12:41 schrieb Txema Heredia:

> Hi all,
> 
> We are having some problems with jobs using a C++ binary program that, simply 
> put, ignores all slot allocations.
> 
> The C code in question uses a call to "sysconf(_SC_NPROCESSORS_ONLN);" to 
> determine the number of threads it can open, and pthreads to parallelize.
> The problem is that this retrieves all the online cores, not just the 
> assigned ones by either SGE or core-binding. So we end up with 12 jobs in a 
> node, each with 1 assigned slot by SGE, the whole job core-binded to that 
> core, but each job using 12 threads that are fighting for cpu cycles inside 
> that single core. Then, load average skyrockets and the node is no longer 
> usable until the cpu-storm passed.
> 
> I have been investigating a little and I haven't found any "out-of-the-box" 
> method to have C report the "granted" number of cores. All the direct methods 
> (single-function call) I have tested report the total number of cores in the 
> system:
> sysconf(_SC_NPROCESSORS_ONLN);
> sysconf(_SC_NPROCESSORS_CONF);
> get_nprocs_conf ();
> get_nprocs ();

Why not just using the $NSLOTS environment variable with getenv("NSLOTS")? Best 
would be to test whether the result is NULL and so it's running outside of SGE 
and to use the original used functions in this case.

In case your applications are dynamically linked, you could load a prepared 
library with LD_PRELOAD replacing the the "sysconf(_SC_NPROCESSORS_ONLN);" 
calls by the result of the environment variable and forward other cases to the 
default libc.

-- Reuti


> The only method I have found ( 
> http://stackoverflow.com/questions/4586405/get-number-of-cpus-in-linux-using-c
>  ) to report the proper number of assigned cores, requires creating a 
> function that loops and checks the job affinity for all the cores.
> This method (apparently) works. It at least reports the number of core-binded 
> cores.
> 
> For reference, this is the code I tested:
> 
> #include <pthread.h>
> #include <unistd.h>
> #include <sys/sysinfo.h>
> #include <stdio.h>
> 
> 
> int GetCPUCount()
> {
>        cpu_set_t cs;
>        CPU_ZERO(&cs);
>        sched_getaffinity(0, sizeof(cs), &cs);
> 
>        int count = 0;
>        for (int i = 0; i < get_nprocs(); i++)
>        {
>                if (CPU_ISSET(i, &cs))
>                        count++;
>        }
>        return count;
> }
> 
> 
> int main(int argc, char* argv[]){
>        long sc = sysconf(_SC_NPROCESSORS_ONLN);
>        long sc_conf = sysconf(_SC_NPROCESSORS_CONF);
>        long nprocs_conf = get_nprocs_conf ();
>        long nprocs = get_nprocs ();
>        long sched = GetCPUCount();
> 
>        printf("sysconf(_SC_NPROCESSORS_ONLN) = %d\n",sc);
>        printf("sysconf(_SC_NPROCESSORS_CONF) = %d\n",sc_conf);
>        printf("get_nprocs_conf() = %d\n",nprocs_conf);
>        printf("get_nprocs() =  %d\n",nprocs);
>        printf("sched_getaffinity = %d\n",sched);
> }
> 
> 
> After submitting it in a job, these are the results:
> 
> #1-slot, core binding=1
> qsub -cwd -l h_vmem=500M -binding linear:1 -b y ./test_n_procs
> 
> sysconf(_SC_NPROCESSORS_ONLN) = 12
> sysconf(_SC_NPROCESSORS_CONF) = 12
> get_nprocs_conf() = 12
> get_nprocs() =  12
> sched_getaffinity = 1
> 
> #3-slots, core binding=1
> qsub -cwd -l h_vmem=500M -pe threaded 3 -binding linear:1 -b y ./test_n_procs
> 
> sysconf(_SC_NPROCESSORS_ONLN) = 12
> sysconf(_SC_NPROCESSORS_CONF) = 12
> get_nprocs_conf() = 12
> get_nprocs() =  12
> sched_getaffinity = 1
> 
> #3-slots, core binding=3
> qsub -cwd -l h_vmem=500M -pe threaded 3 -binding linear:3 -b y ./test_n_procs
> 
> sysconf(_SC_NPROCESSORS_ONLN) = 12
> sysconf(_SC_NPROCESSORS_CONF) = 12
> get_nprocs_conf() = 12
> get_nprocs() =  12
> sched_getaffinity = 3
> 
> #3-to-6-slots, core binding=6
> qsub -cwd -l h_vmem=500M -pe threaded 3-6 -binding linear:6 -b y 
> ./test_n_procs
> 
> sysconf(_SC_NPROCESSORS_ONLN) = 12
> sysconf(_SC_NPROCESSORS_CONF) = 12
> get_nprocs_conf() = 12
> get_nprocs() =  12
> sched_getaffinity = 6
> 
> 
> 
> Has anyone encountered this problem before? Is there a more elegant solution? 
> Is there a way that doesn't require reprograming all the software that faces 
> this problem?
> 
> Thanks in advance,
> 
> Txema
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to