2009/3/31 Ralph Castain <r...@lanl.gov>: > It is very hard to debug the problem with so little information. We
Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll try my best to fill you guys in on as much debug info as I can. > regularly run OMPI jobs on Torque without issue. So do we. In fact on the very same cluster other jobs using the same code do run fine. Its only this one type of jobs that I am seeing this strange behavior on. For those more curious, the code I am trying to run is a computational chemistry code called DACAPO developed at CAMd at the Technical University of Denemark. Link: https://wiki.fysik.dtu.dk/dacapo Hardware Architecture: Dell rack servers: PowerEdge SC1435. 2.2GHz Opteron 1Ghz. (AMD) 8 cpus per node. > Are you getting an allocation from somewhere for the nodes? >If so, are you > using Moab to get it? We are using Torque as the scheduler and Maui as the master scheduler. >Do you have a $PBS_NODEFILE in your environment? Yes, I do. For a test case I was trying to run on a single node (which has 8 cpus) If I cat $PBS_NODEFILE I do get the name "node17" 8 times. I did dump the environment variables from a running job. I get: PBS_NODEFILE="/var/spool/torque/aux//4609.uranus.che.foo.edu" > I have no idea why your processes are crashing when run via Torque - are you > sure that the processes themselves crash? >Are they segfaulting - if so, can Yes, they are indeed segfaulting. And only when I run them through Torque. ######################################## forrtl: error (78): process killed (SIGTERM) mpirun noticed that job rank 5 with PID 10580 on node node17 exited on signal 11 (Segmentation fault). ######################################### Exact same job runs like a charm if I submit it via mprrun on the node outside of Torque. > you use gdb to find out where? I can try that. I haven't used gdb much before. In case it matters the core executable is a fortran source compiled via the Intel Fortran Compiler ifort. That executable runs fine for all other cases except this one. Maybe this helps more? -- Rahul