> Am 17.10.2016 um 16:36 schrieb Lucas, Douglas S. <lucas.sco...@mayo.edu>:
> 
> We're running oge 2011.11 on an sgi uv 2000 (smp) w 256 hyperthreaded cores 
> (128 physical). When we run an openmp job on the system, it runs fine. Here's 
> the job:
> 
> #include <iostream>
> #include <cstring>
> #include <cstdlib>
> #include <math.h>
> #include <omp.h>
>  
> using namespace std;
>  
> int main (
>         int argc,
>         char* argv[] ) {
>  
>  
> #if _OPENMP
>     // Show how many threads we have available
>     int max_t = omp_get_max_threads();
>     cout << "OpenMP using up to " << max_t << " threads" << endl;
> #else
>     cout << "!!!ERROR!!! Program not compiled for OpenMP" << endl;
>     return -1;
> #endif
>  
>     const long N = 115166;
>     const long bytesRequested = N * N * sizeof(double);
>  
>     cout << "Allocating " << bytesRequested << " bytes for matrix" <<     
> endl;
>  
>     double* S = new double[ N * N ];
>  
>     if( NULL == S ) {
>         cout << "!!!ERROR!!! Failed to allocate " << bytesRequested << "      
>    bytes" << endl;
>         return -1;
>     }
>  
>     cout << "Entering main loop" << endl;
>  
> #pragma omp parallel for schedule(static)
>     for ( long i = 0; i < N - 1; i++ ) {
>         for ( long j = i + 1; j < N; j++ ) {
> #if _OPENMP
>             int tid=omp_get_thread_num();
>             if( 0 == i && 1 == j ) {
>                 int nThreads=omp_get_num_threads();
>                 cout << "OpenMP loop using " << nThreads << " threads" <<     
> endl;
>             }
> #endif
>  
>             S[ i * N + j ] = sqrt( i + j );
>         }
>     }
>  
>     cout << "Loop completed" << endl;
>     delete S;
>     return 0;
> }
>  
> 
>  
> 
> And here's it being executed:
> 
> [c++]$ ./OMPtest
> 
> OpenMP using up to 256 threads
> 
> Allocating 106105660448 bytes for matrix
> 
> Entering main loop OpenMP loop using 256 threads
> 
> Loop completed
> 
> However, when I submit it in the queue using the following (and so far any) 
> parallel environment, the load on the cpu shoots through the roof (well over 
> 256), and the system becomes completely unresponsive and has to be power 
> cycled. Here's my pe environment:
> 
> [c++]$ qconf -sp threaded
> 
> pe_name threaded
> 
> slots 10000
> 
> user_lists NONE
> 
> xuser_lists NONE
> 
> start_proc_args /bin/true
> 
> stop_proc_args /bin/true
> 
> allocation_rule $pe_slots
> 
> control_slaves FALSE
> 
> job_is_first_task TRUE
> 
> urgency_slots min
> 
> accounting_summary TRUE
> 
> I've changed control_slaves, job_is_first_task, slots (reduced to under 140, 
> anything over 140 and I get the runaway load condition previously described) 
> I've even used different parallel environments that I've created. I've also 
> reduced slot count in the queue to 140, yet the load still runs away to over 
> 256 and locks the machine (requiring a hard reboot). Lastly, I've tried 
> numerous iterations of my qsub script, but here's my current version of it:

In actual Linux kernels also processes in uninterruptible kernel task state 
will count as running.

How much memory does your machine have? With h_vmem=4G this will be multiplied 
by the number of slots. Could you face heavy swapping? As you are using 
threads, maybe it would be sufficient to divide the value beforehand by the 
number of requested slots.

You could also check the ulimits between an interactive start and the running 
under SGE.

-- Reuti


> #!/bin/sh
> #$ -cwd
> #$ -q sgi-test
> ## email on a - abort, b - begin, e - end
> #$ -m abe
> #$ -M <email address>
> #source ~/.bash_profile
> ## for this job, specifying the threaded environment w a "-" ensures the      
>        max number of processors is used
> #$ -pe threaded -
> echo "slots = $NSLOTS"
> export OMP_NUM_THREADS=$NSLOTS
> echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
> echo "Running on host=$HOSTNAME"
> ## memory resource request per thread, max 24 for 32 threads
> #$ -l h_vmem=4G
> ##$ -V
> ##this environment variable setting is needed only for OpenMP-parallelized    
>  applications
> ## finally! -- run your process
> <path>/OMPtest
>  
> 
> Since unlimited processors/slots have always crashed the machine, I've 
> specified:
> 
>     #$ -pe threaded 139
> Anything above 139 crashes the machine, yet there's no output in mcelog or 
> /var/log/messages. Any insight into what could be happening would be greatly 
> appreciated!
> 
>  
>  
> Scott Lucas
> HPC Applications Support
> 208-776-0209
> lucas.sco...@mayo.edu
>  
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to