Hi Ralph,
Here is the output of "-mca plm_base_verbose 5". [node08.cluster:23573] mca:base:select:( plm) Querying component [rsh] [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on agent /usr/bin/rsh path NULL [node08.cluster:23573] mca:base:select:( plm) Query of component [rsh] set priority to 10 [node08.cluster:23573] mca:base:select:( plm) Querying component [slurm] [node08.cluster:23573] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [node08.cluster:23573] mca:base:select:( plm) Querying component [tm] [node08.cluster:23573] mca:base:select:( plm) Query of component [tm] set priority to 75 [node08.cluster:23573] mca:base:select:( plm) Selected component [tm] [node08.cluster:23573] plm:base:set_hnp_name: initial bias 23573 nodename hash 85176670 [node08.cluster:23573] plm:base:set_hnp_name: final jobfam 59480 [node08.cluster:23573] [[59480,0],0] plm:base:receive start comm [node08.cluster:23573] [[59480,0],0] plm:base:setup_job [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm creating map [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only HNP in allocation -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- Here, openmpi's configuration is as follows: ./configure \ --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ --with-tm \ --with-verbs \ --disable-ipv6 \ --disable-vt \ --enable-debug \ CC=pgcc CFLAGS="-tp k8-64e" \ CXX=pgCC CXXFLAGS="-tp k8-64e" \ F77=pgfortran FFLAGS="-tp k8-64e" \ FC=pgfortran FCFLAGS="-tp k8-64e" > Hi Ralph, > > Okey, I can help you. Please give me some time to report the output. > > Tetsuya Mishima > > > I can try, but I have no way of testing Torque any more - so all I can do > is a code review. If you can build --enable-debug and add -mca > plm_base_verbose 5 to your cmd line, I'd appreciate seeing the > > output. > > > > > > On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > > > > > Hi Ralph, > > > > > > Thank you for your quick response. > > > > > > I'd like to report one more regressive issue about Torque support of > > > openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper > > > has problems" I reported a few days ago. > > > > > > The script below does not work with openmpi-1.7.4a1r29646, > > > although it worked with openmpi-1.7.3 as I told you before. > > > > > > #!/bin/sh > > > #PBS -l nodes=node08:ppn=8 > > > export OMP_NUM_THREADS=1 > > > cd $PBS_O_WORKDIR > > > cp $PBS_NODEFILE pbs_hosts > > > NPROCS=`wc -l < pbs_hosts` > > > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to > core > > > Myprog > > > > > > If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works fine. > > > Since this happens without lama request, I guess it's not the problem > > > in lama itself. Anyway, please look into this issue as well. > > > > > > Regards, > > > Tetsuya Mishima > > > > > >> Done - thanks! > > >> > > >> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote: > > >> > > >>> > > >>> > > >>> Dear openmpi developers, > > >>> > > >>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 > built > > > by > > >>> PGI13.10 as shown below: > > >>> > > >>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2 > > >>> -report-bindings mPre > > >>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]], > > > socket > > >>> 0[core 5[hwt 0]]: [././././B/B][./././././.] > > >>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt 0]], > > > socket > > >>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] > > >>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > > socket > > >>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] > > >>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt 0]], > > > socket > > >>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] > > >>> [manage:23082] *** Process received signal *** > > >>> [manage:23082] Signal: Segmentation fault (11) > > >>> [manage:23082] Signal code: Address not mapped (1) > > >>> [manage:23082] Failing at address: 0x34 > > >>> [manage:23082] *** End of error message *** > > >>> Segmentation fault (core dumped) > > >>> > > >>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082 > > >>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > > >>> Copyright (C) 2009 Free Software Foundation, Inc. > > >>> ... > > >>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 -report-bindings > > >>> mPre'. > > >>> Program terminated with signal 11, Segmentation fault. > > >>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > > > sd=32767, > > >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > > >>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > > >>> (gdb) where > > >>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f, > > > sd=32767, > > >>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 > > >>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, flags=32767, > > >>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 > > >>> #2 0x00002b5f848eb06a in event_process_active_single_queue > > >>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) > > >>> at ./event.c:1366 > > >>> #3 0x00002b5f848eb270 in event_process_active > > > (base=0x5f848eb84900007f) > > >>> at ./event.c:1435 > > >>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop > > >>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 > > >>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8) > > >>> at ./orterun.c:1030 > > >>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8) > > > at ./main.c:13 > > >>> (gdb) quit > > >>> > > >>> > > >>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently unnecessary, > > > which > > >>> causes the segfault. > > >>> > > >>> 624 /* lookup the corresponding process */ > > >>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin); > > >>> 626 if (NULL == peer) { > > >>> 627 ui64 = (uint64_t*)(&peer->name); > > >>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, > > >>> orte_oob_base_framework.framework_output, > > >>> 629 "%s mca_oob_tcp_recv_connect: > > >>> connection from new peer", > > >>> 630 ORTE_NAME_PRINT > (ORTE_PROC_MY_NAME)); > > >>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); > > >>> 632 peer->mod = mod; > > >>> 633 peer->name = hdr->origin; > > >>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; > > >>> 635 ui64 = (uint64_t*)(&peer->name); > > >>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64 > > > (&mod-> > > >>> peers, (*ui64), peer)) { > > >>> 637 OBJ_RELEASE(peer); > > >>> 638 return; > > >>> 639 } > > >>> > > >>> > > >>> Please fix this mistake in the next release. > > >>> > > >>> Regards, > > >>> Tetsuya Mishima > > >>> > > >>> _______________________________________________ > > >>> users mailing list > > >>> us...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users