We're running oge 2011.11 on an sgi uv 2000 (smp) w 256 hyperthreaded cores 
(128 physical). When we run an openmp job on the system, it runs fine. Here's 
the job:
#include <iostream>
#include <cstring>
#include <cstdlib>
#include <math.h>
#include <omp.h>

using namespace std;

int main (
        int argc,
        char* argv[] ) {


#if _OPENMP
    // Show how many threads we have available
    int max_t = omp_get_max_threads();
    cout << "OpenMP using up to " << max_t << " threads" << endl;
#else
    cout << "!!!ERROR!!! Program not compiled for OpenMP" << endl;
    return -1;
#endif

    const long N = 115166;
    const long bytesRequested = N * N * sizeof(double);

    cout << "Allocating " << bytesRequested << " bytes for matrix" <<     endl;

    double* S = new double[ N * N ];

    if( NULL == S ) {
        cout << "!!!ERROR!!! Failed to allocate " << bytesRequested << "        
 bytes" << endl;
        return -1;
    }

    cout << "Entering main loop" << endl;

#pragma omp parallel for schedule(static)
    for ( long i = 0; i < N - 1; i++ ) {
        for ( long j = i + 1; j < N; j++ ) {
#if _OPENMP
            int tid=omp_get_thread_num();
            if( 0 == i && 1 == j ) {
                int nThreads=omp_get_num_threads();
                cout << "OpenMP loop using " << nThreads << " threads" <<     
endl;
            }
#endif

            S[ i * N + j ] = sqrt( i + j );
        }
    }

    cout << "Loop completed" << endl;
    delete S;
    return 0;
}


And here's it being executed:
[c++]$ ./OMPtest
OpenMP using up to 256 threads
Allocating 106105660448 bytes for matrix
Entering main loop OpenMP loop using 256 threads
Loop completed
However, when I submit it in the queue using the following (and so far any) 
parallel environment, the load on the cpu shoots through the roof (well over 
256), and the system becomes completely unresponsive and has to be power 
cycled. Here's my pe environment:
[c++]$ qconf -sp threaded
pe_name threaded
slots 10000
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary TRUE
I've changed control_slaves, job_is_first_task, slots (reduced to under 140, 
anything over 140 and I get the runaway load condition previously described) 
I've even used different parallel environments that I've created. I've also 
reduced slot count in the queue to 140, yet the load still runs away to over 
256 and locks the machine (requiring a hard reboot). Lastly, I've tried 
numerous iterations of my qsub script, but here's my current version of it:
#!/bin/sh
#$ -cwd
#$ -q sgi-test
## email on a - abort, b - begin, e - end
#$ -m abe
#$ -M <email address>
#source ~/.bash_profile
## for this job, specifying the threaded environment w a "-" ensures the        
     max number of processors is used
#$ -pe threaded -
echo "slots = $NSLOTS"
export OMP_NUM_THREADS=$NSLOTS
echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
echo "Running on host=$HOSTNAME"
## memory resource request per thread, max 24 for 32 threads
#$ -l h_vmem=4G
##$ -V
##this environment variable setting is needed only for OpenMP-parallelized     
applications
## finally! -- run your process
<path>/OMPtest

Since unlimited processors/slots have always crashed the machine, I've 
specified:
    #$ -pe threaded 139
Anything above 139 crashes the machine, yet there's no output in mcelog or 
/var/log/messages. Any insight into what could be happening would be greatly 
appreciated!


Scott Lucas
HPC Applications Support
208-776-0209
lucas.sco...@mayo.edu

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to