Also, you need to tell mpirun that the nodes aren't the same - add --hetero-nodes to your cmd line
On Nov 13, 2013, at 10:14 PM, tmish...@jcity.maeda.co.jp wrote: > > > Thank you, Ralph! > > I didn't know that function of cups-per-proc. > As fas as I know, it didn't work in openmpi-1.6.x like that. > It was just 4 cores binding... > > Today I don't have much time and I'll check it tomorrow. > And thank you again for checking oversubscription problem. > > tmishima > > >> Guess I don't see why modifying the allocation is required - we have > mapping options that should support such things. If you specify the total > number of procs you want, and cpus-per-proc=4, it should >> do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8 > on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified > np=16). So I guess I don't understand the issue. >> >> Regardless, if NPROCS=8 (and you verified that by printing it out, not > just assuming wc -l got that value), then it shouldn't think it is > oversubscribed. I'll take a look under a slurm allocation as >> that is all I can access. >> >> >> On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Our cluster consists of three types of nodes. They have 8, 32 >>> and 64 slots respectively. Since the performance of each core is >>> almost same, mixed use of these nodes is possible. >>> >>> Furthremore, in this case, for hybrid application with openmpi+openmp, >>> the modification of hostfile is necesarry as follows: >>> >>> #PBS -l nodes=1:ppn=32+4:ppn=8 >>> export OMP_NUM_THREADS=4 >>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines >>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS >>> Myprog >>> >>> That's why I want to do that. >>> >>> Of course I know, If I quit mixed use, -npernode is better for this >>> purpose. >>> >>> (The script I showed you first is just a simplified one to clarify the >>> problem.) >>> >>> tmishima >>> >>> >>>> Why do it the hard way? I'll look at the FAQ because that definitely >>> isn't a recommended thing to do - better to use -host to specify the >>> subset, or just specify the desired mapping using all the >>>> various mappers we provide. >>>> >>>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Sorry for cross-post. >>>>> >>>>> Nodefile is very simple which consists of 8 lines: >>>>> >>>>> node08 >>>>> node08 >>>>> node08 >>>>> node08 >>>>> node08 >>>>> node08 >>>>> node08 >>>>> node08 >>>>> >>>>> Therefore, NPROCS=8 >>>>> >>>>> My aim is to modify the allocation as you pointed out. According to >>> Openmpi >>>>> FAQ, >>>>> proper subset of the hosts allocated to the Torque / PBS Pro job > should >>> be >>>>> allowed. >>>>> >>>>> tmishima >>>>> >>>>>> Please - can you answer my question on script2? What is the value of >>>>> NPROCS? >>>>>> >>>>>> Why would you want to do it this way? Are you planning to modify the >>>>> allocation?? That generally is a bad idea as it can confuse the > system >>>>>> >>>>>> >>>>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Since what I really want is to run script2 correctly, please let us >>>>>>> concentrate script2. >>>>>>> >>>>>>> I'm not an expert of the inside of openmpi. What I can do is just >>>>>>> obsabation >>>>>>> from the outside. I doubt these lines are strange, especially the >>> last >>>>> one. >>>>>>> >>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] >>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list >>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps >>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list >>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 >>> inuse >>>>> 0 >>>>>>> >>>>>>> These lines come from this part of orte_rmaps_base_get_target_nodes >>>>>>> in rmaps_base_support_fns.c: >>>>>>> >>>>>>> } else if (node->slots <= node->slots_inuse && >>>>>>> (ORTE_MAPPING_NO_OVERSUBSCRIBE & >>>>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) { >>>>>>> /* remove the node as fully used */ >>>>>>> OPAL_OUTPUT_VERBOSE((5, >>>>>>> orte_rmaps_base_framework.framework_output, >>>>>>> "%s Removing node %s slots %d inuse >>>>> %d", >>>>>>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), >>>>>>> node->name, node->slots, node-> >>>>>>> slots_inuse)); >>>>>>> opal_list_remove_item(allocated_nodes, item); >>>>>>> OBJ_RELEASE(item); /* "un-retain" it */ >>>>>>> >>>>>>> I wonder why node->slots and node->slots_inuse is 0, which I can > read >>>>>>> from the above line "Removing node node08 slots 0 inuse 0". >>>>>>> >>>>>>> Or I'm not sure but >>>>>>> "else if (node->slots <= node->slots_inuse &&" should be >>>>>>> "else if (node->slots < node->slots_inuse &&" ? >>>>>>> >>>>>>> tmishima >>>>>>> >>>>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, the node08 has 8 slots but the process I run is also 8. >>>>>>>>> >>>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>>> >>>>>>>>> Therefore, I think it should allow this allocation. Is that > right? >>>>>>>> >>>>>>>> Correct >>>>>>>> >>>>>>>>> >>>>>>>>> My question is why scritp1 works and script2 does not. They are >>>>>>>>> almost same. >>>>>>>>> >>>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>>> export OMP_NUM_THREADS=1 >>>>>>>>> cd $PBS_O_WORKDIR >>>>>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>>>>> >>>>>>>>> #SCRITP1 >>>>>>>>> mpirun -report-bindings -bind-to core Myprog >>>>>>>>> >>>>>>>>> #SCRIPT2 >>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings >>> -bind-to >>>>>>> core >>>>>>>> >>>>>>>> This version is not only reading the PBS allocation, but also >>> invoking >>>>>>> the hostfile filter on top of it. Different code path. I'll take a >>> look >>>>> - >>>>>>> it should still match up assuming NPROCS=8. Any >>>>>>>> possibility that it is a different number? I don't recall, but > isn't >>>>>>> there some extra lines in the nodefile - e.g., comments? >>>>>>>> >>>>>>>> >>>>>>>>> Myprog >>>>>>>>> >>>>>>>>> tmishima >>>>>>>>> >>>>>>>>>> I guess here's my confusion. If you are using only one node, and >>>>> that >>>>>>>>> node has 8 allocated slots, then we will not allow you to run > more >>>>> than >>>>>>> 8 >>>>>>>>> processes on that node unless you specifically provide >>>>>>>>>> the --oversubscribe flag. This is because you are operating in a >>>>>>> managed >>>>>>>>> environment (in this case, under Torque), and so we treat the >>>>>>> allocation as >>>>>>>>> "mandatory" by default. >>>>>>>>>> >>>>>>>>>> I suspect that is the issue here, in which case the system is >>>>> behaving >>>>>>> as >>>>>>>>> it should. >>>>>>>>>> >>>>>>>>>> Is the above accurate? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org> >>> wrote: >>>>>>>>>> >>>>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper. >>>>>>>>>>> >>>>>>>>>>> How many nodes are in this allocation? >>>>>>>>>>> >>>>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Ralph, this is an additional information. >>>>>>>>>>>> >>>>>>>>>>>> Here is the main part of output by adding "-mca >>> rmaps_base_verbose >>>>>>>>> 50". >>>>>>>>>>>> >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm > creating >>>>> map >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only > HNP >>> in >>>>>>>>>>>> allocation >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1] >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job >>>>> [56581,1] >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using >>> ppr >>>>>>>>> mapper >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job >>>>> [56581,1] >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using >>> seq >>>>>>>>> mapper >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform >>> initial >>>>>>> map >>>>>>>>> of >>>>>>>>>>>> job [56581,1] - no fault groups >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not >>> using >>>>>>>>> mindist >>>>>>>>>>>> mapper >>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1] >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in >>> list >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list >>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 > slots >>> 0 >>>>>>>>> inuse 0 >>>>>>>>>>>> >>>>>>>>>>>> From this result, I guess it's related to oversubscribe. >>>>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as >>> show >>>>>>>>> below: >>>>>>>>>>>> >>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in >>> list >>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps >>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list >>>>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING: >>>>>>>>>>>> [node08.cluster:27019] node: node08 daemon: 0 >>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job >>>>>>> [56774,1] >>>>>>>>>>>> slots 1 num_procs 8 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - >>>>>>>>> skipping >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is >>>>>>>>> oversubscribed - >>>>>>>>>>>> performing second pass >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs > to >>>>>>> node >>>>>>>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot >>> for >>>>>>> job >>>>>>>>>>>> [56774,1] >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to > node >>>>>>> node08 >>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to > node >>>>>>> node08 >>>>>>>>>>>> >>>>>>>>>>>> I think something is wrong with treatment of oversubscription, >>>>> which >>>>>>>>> might >>>>>>>>>>>> be >>>>>>>>>>>> related to "#3893: LAMA mapper has problems" >>>>>>>>>>>> >>>>>>>>>>>> tmishima >>>>>>>>>>>> >>>>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you >>>>> rerun >>>>>>>>> and >>>>>>>>>>>> add -mca ras_base_verbose 50? >>>>>>>>>>>>> >>>>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp > wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5". >>>>>>>>>>>>>> >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>>>> component >>>>>>>>> [rsh] >>>>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on >>>>>>>>>>>>>> agent /usr/bin/rsh path NULL >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of >>>>> component >>>>>>>>> [rsh] >>>>>>>>>>>> set >>>>>>>>>>>>>> priority to 10 >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>>>> component >>>>>>>>>>>> [slurm] >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Skipping >>>>> component >>>>>>>>>>>> [slurm]. >>>>>>>>>>>>>> Query failed to return a module >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Querying >>>>> component >>>>>>>>> [tm] >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Query of >>>>> component >>>>>>>>> [tm] >>>>>>>>>>>> set >>>>>>>>>>>>>> priority to 75 >>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:( plm) Selected >>>>> component >>>>>>>>> [tm] >>>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias >>> 23573 >>>>>>>>>>>> nodename >>>>>>>>>>>>>> hash 85176670 >>>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam >>> 59480 >>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start >>> comm >>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job >>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm >>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm >>> creating >>>>>>> map >>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only >>> HNP >>>>> in >>>>>>>>>>>>>> allocation >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>>>>> All nodes which are allocated for this job are already > filled. >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here, openmpi's configuration is as follows: >>>>>>>>>>>>>> >>>>>>>>>>>>>> ./configure \ >>>>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \ >>>>>>>>>>>>>> --with-tm \ >>>>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>>>> --disable-ipv6 \ >>>>>>>>>>>>>> --disable-vt \ >>>>>>>>>>>>>> --enable-debug \ >>>>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \ >>>>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \ >>>>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \ >>>>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e" >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Okey, I can help you. Please give me some time to report > the >>>>>>>>> output. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more - > so >>>>> all >>>>>>> I >>>>>>>>>>>> can >>>>>>>>>>>>>> do >>>>>>>>>>>>>>> is a code review. If you can build --enable-debug and add >>> -mca >>>>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing >>> the >>>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp >>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you for your quick response. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque >>>>>>> support >>>>>>>>> of >>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: >>> LAMA >>>>>>>>> mapper >>>>>>>>>>>>>>>>> has problems" I reported a few days ago. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The script below does not work with > openmpi-1.7.4a1r29646, >>>>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you > before. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> #!/bin/sh >>>>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8 >>>>>>>>>>>>>>>>> export OMP_NUM_THREADS=1 >>>>>>>>>>>>>>>>> cd $PBS_O_WORKDIR >>>>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts >>>>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts` >>>>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} >>> -report-bindings >>>>>>>>>>>> -bind-to >>>>>>>>>>>>>>> core >>>>>>>>>>>>>>>>> Myprog >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then > it >>>>>>> works >>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not >>> the >>>>>>>>>>>> problem>>>>>>>>>>>>>>> in lama itself. Anyway, please look > into this issue as >>> well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Done - thanks! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp >>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Dear openmpi developers, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of >>>>>>>>> openmpi-1.7.4a1r29646 >>>>>>>>>>>>>>> built >>>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>>> PGI13.10 as shown below: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 >>>>>>>>>>>> -cpus-per-proc >>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>> -report-bindings mPre >>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0 > [core >>> 4 >>>>>>> [hwt >>>>>>>>>>>> 0]], >>>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.] >>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1 > [core >>> 6 >>>>>>> [hwt >>>>>>>>>>>> 0]], >>>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.] >>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0 > [core >>> 0 >>>>>>> [hwt >>>>>>>>>>>> 0]], >>>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.] >>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0 > [core >>> 2 >>>>>>> [hwt >>>>>>>>>>>> 0]], >>>>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.] >>>>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal *** >>>>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11) >>>>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1) >>>>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34 >>>>>>>>>>>>>>>>>>> [manage:23082] *** End of error message *** >>>>>>>>>>>>>>>>>>> Segmentation fault (core dumped) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun >>>>> core.23082 >>>>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) >>>>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc. >>>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2 >>>>>>>>>>>>>> -report-bindings >>>>>>>>>>>>>>>>>>> mPre'. >>>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect>>> >>> (mod=0x5f861ca20b00007f, >>>>>>>>>>>>>>>>> sd=32767, >>>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>>>>>>>>>>>>> (gdb) where >>>>>>>>>>>>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect >>>>>>> (mod=0x5f861ca20b00007f, >>>>>>>>>>>>>>>>> sd=32767, >>>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631 >>>>>>>>>>>>>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023, >>>>>>>>>>>> flags=32767, >>>>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760 >>>>>>>>>>>>>>>>>>> #2 0x00002b5f848eb06a in >>> event_process_active_single_queue >>>>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff) >>>>>>>>>>>>>>>>>>> at ./event.c:1366 >>>>>>>>>>>>>>>>>>> #3 0x00002b5f848eb270 in event_process_active >>>>>>>>>>>>>>>>> (base=0x5f848eb84900007f) >>>>>>>>>>>>>>>>>>> at ./event.c:1435 >>>>>>>>>>>>>>>>>>> #4 0x00002b5f848eb849 in >>> opal_libevent2021_event_base_loop >>>>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645 >>>>>>>>>>>>>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, >>>>>>> argv=0x7fff25bbd4a8) >>>>>>>>>>>>>>>>>>> at ./orterun.c:1030 >>>>>>>>>>>>>>>>>>> #6 0x00000000004067fb in main (argc=7, >>>>> argv=0x7fff25bbd4a8) >>>>>>>>>>>>>>>>> at ./main.c:13 >>>>>>>>>>>>>>>>>>> (gdb) quit >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is > apparently >>>>>>>>>>>>>> unnecessary, >>>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>>> causes the segfault. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 624 /* lookup the corresponding process >>>>> */>>>>>>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr-> >>> origin); >>>>>>>>>>>>>>>>>>> 626 if (NULL == peer) { >>>>>>>>>>>>>>>>>>> 627 ui64 = (uint64_t*)(&peer->name); >>>>>>>>>>>>>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, >>>>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output, >>>>>>>>>>>>>>>>>>> 629 "%s >>>>>>> mca_oob_tcp_recv_connect: >>>>>>>>>>>>>>>>>>> connection from new peer", >>>>>>>>>>>>>>>>>>> 630 ORTE_NAME_PRINT >>>>>>>>>>>>>>> (ORTE_PROC_MY_NAME)); >>>>>>>>>>>>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t); >>>>>>>>>>>>>>>>>>> 632 peer->mod = mod; >>>>>>>>>>>>>>>>>>> 633 peer->name = hdr->origin; >>>>>>>>>>>>>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING; >>>>>>>>>>>>>>>>>>> 635 ui64 = (uint64_t*)(&peer->name); >>>>>>>>>>>>>>>>>>> 636 if (OPAL_SUCCESS != >>>>>>>>> opal_hash_table_set_value_uint64 >>>>>>>>>>>>>>>>> (&mod-> >>>>>>>>>>>>>>>>>>> peers, (*ui64), peer)) { >>>>>>>>>>>>>>>>>>> 637 OBJ_RELEASE(peer); >>>>>>>>>>>>>>>>>>> 638 return; >>>>>>>>>>>>>>>>>>> 639 } >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Please fix this mistake in the next release. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users