Hi all I defined mpifillamd : ---------------- pe_name mpifillamd slots 9999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile stop_proc_args /opt/gridengine/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary TRUE ---------------- and mpi48amd : pe_name mpi48amd slots 9999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile stop_proc_args /opt/gridengine/mpi/stopmpi.sh allocation_rule 48 control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary TRUE ----------------- hosts have 48 cores. my job is: ----- #!/bin/sh #$ -S /bin/bash #$ -N AMD_NS_100 #$ -cwd #$ -l h_vmem=1.4G,excl=1 (request exclusive host) #$ -j y #$ -pe mpi48amd 96 mpirun --mca btl ^openib,sm -np $NSLOTS ./a.out ------------------ 1) then send a job with mpifillamd , anything is ok, but not with pe mpi48amd and multiple ".btr" file was created. why?
one of .btr file is: ---------- a.out:19402 terminated with signal 11 at PC=4636f1 SP=7ffff6536678. Backtrace: ./a.out(initial_comm_cell_+0x611)[0x4636f1] ./a.out(input_+0xfe7)[0x41a657] ----------- output is: .... TASK WIRH RANK 48 HASICMAXP = 96000 TASK WIRH RANK 49 HASICMAXP = 96000 TASK WIRH RANK 50 HASICMAXP = 96000 TASK WIRH RANK 51 HASICMAXP = 96000 a.out:19391 terminated with signal 11 at PC=4636f1 SP=7ffff9274878. Backtrace: TASK WIRH RANK 52 HASICMAXP = 96000 TASK WIRH RANK 53 HASICMAXP = 96000 ./a.out(initial_comm_cell_+0x611)[0x4636f1] ./a.out(input_+0xfe7)[0x41a657] ./a.out(MAIN__+0x313)[0x40b013] ./a.out(main+0x3c)[0x40acec] /lib64/libc.so.6(__libc_start_main+0xf4)[0x31aee1d994] ./a.out[0x40abf9] ... -------------- 2) In that case (hosts have 48 slots, job request 96 slots and exclusive host), are mpifillamd and mpi48amd different? thx
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
