Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Ralph Castain Thu, 14 Nov 2013 03:00:52 -0500 (EST)

Also, you need to tell mpirun that the nodes aren't the same - add 
--hetero-nodes to your cmd line



On Nov 13, 2013, at 10:14 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Thank you, Ralph!
> 
> I didn't know that function of cups-per-proc.
> As fas as I know, it didn't work in openmpi-1.6.x like that.
> It was just 4 cores binding...
> 
> Today I don't have much time and I'll check it tomorrow.
> And thank you again for checking oversubscription problem.
> 
> tmishima
> 
> 
>> Guess I don't see why modifying the allocation is required - we have
> mapping options that should support such things. If you specify the total
> number of procs you want, and cpus-per-proc=4, it should
>> do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8
> on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified
> np=16). So I guess I don't understand the issue.
>> 
>> Regardless, if NPROCS=8 (and you verified that by printing it out, not
> just assuming wc -l got that value), then it shouldn't think it is
> oversubscribed. I'll take a look under a slurm allocation as
>> that is all I can access.
>> 
>> 
>> On Nov 13, 2013, at 7:23 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Our cluster consists of three types of nodes. They have 8, 32
>>> and 64 slots respectively. Since the performance of each core is
>>> almost same, mixed use of these nodes is possible.
>>> 
>>> Furthremore, in this case, for hybrid application with openmpi+openmp,
>>> the modification of hostfile is necesarry as follows:
>>> 
>>> #PBS -l nodes=1:ppn=32+4:ppn=8
>>> export OMP_NUM_THREADS=4
>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
>>> Myprog
>>> 
>>> That's why I want to do that.
>>> 
>>> Of course I know, If I quit mixed use, -npernode is better for this
>>> purpose.
>>> 
>>> (The script I showed you first is just a simplified one to clarify the
>>> problem.)
>>> 
>>> tmishima
>>> 
>>> 
>>>> Why do it the hard way? I'll look at the FAQ because that definitely
>>> isn't a recommended thing to do - better to use -host to specify the
>>> subset, or just specify the desired mapping using all the
>>>> various mappers we provide.
>>>> 
>>>> On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> Sorry for cross-post.
>>>>> 
>>>>> Nodefile is very simple which consists of 8 lines:
>>>>> 
>>>>> node08
>>>>> node08
>>>>> node08
>>>>> node08
>>>>> node08
>>>>> node08
>>>>> node08
>>>>> node08
>>>>> 
>>>>> Therefore, NPROCS=8
>>>>> 
>>>>> My aim is to modify the allocation as you pointed out. According to
>>> Openmpi
>>>>> FAQ,
>>>>> proper subset of the hosts allocated to the Torque / PBS Pro job
> should
>>> be
>>>>> allowed.
>>>>> 
>>>>> tmishima
>>>>> 
>>>>>> Please - can you answer my question on script2? What is the value of
>>>>> NPROCS?
>>>>>> 
>>>>>> Why would you want to do it this way? Are you planning to modify the
>>>>> allocation?? That generally is a bad idea as it can confuse the
> system
>>>>>> 
>>>>>> 
>>>>>> On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Since what I really want is to run script2 correctly, please let us
>>>>>>> concentrate script2.
>>>>>>> 
>>>>>>> I'm not an expert of the inside of openmpi. What I can do is just
>>>>>>> obsabation
>>>>>>> from the outside. I doubt these lines are strange, especially the
>>> last
>>>>> one.
>>>>>>> 
>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
>>> inuse
>>>>> 0
>>>>>>> 
>>>>>>> These lines come from this part of orte_rmaps_base_get_target_nodes
>>>>>>> in rmaps_base_support_fns.c:
>>>>>>> 
>>>>>>>     } else if (node->slots <= node->slots_inuse &&
>>>>>>>                (ORTE_MAPPING_NO_OVERSUBSCRIBE &
>>>>>>> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
>>>>>>>         /* remove the node as fully used */
>>>>>>>         OPAL_OUTPUT_VERBOSE((5,
>>>>>>> orte_rmaps_base_framework.framework_output,
>>>>>>>                              "%s Removing node %s slots %d inuse
>>>>> %d",
>>>>>>>                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>>>>>>                              node->name, node->slots, node->
>>>>>>> slots_inuse));
>>>>>>>         opal_list_remove_item(allocated_nodes, item);
>>>>>>>         OBJ_RELEASE(item);  /* "un-retain" it */
>>>>>>> 
>>>>>>> I wonder why node->slots and node->slots_inuse is 0, which I can
> read
>>>>>>> from the above line "Removing node node08 slots 0 inuse 0".
>>>>>>> 
>>>>>>> Or I'm not sure but
>>>>>>> "else if (node->slots <= node->slots_inuse &&" should be
>>>>>>> "else if (node->slots < node->slots_inuse &&" ?
>>>>>>> 
>>>>>>> tmishima
>>>>>>> 
>>>>>>>> On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Yes, the node08 has 8 slots but the process I run is also 8.
>>>>>>>>> 
>>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>>> 
>>>>>>>>> Therefore, I think it should allow this allocation. Is that
> right?
>>>>>>>> 
>>>>>>>> Correct
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> My question is why scritp1 works and script2 does not. They are
>>>>>>>>> almost same.
>>>>>>>>> 
>>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>>> export OMP_NUM_THREADS=1
>>>>>>>>> cd $PBS_O_WORKDIR
>>>>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>>>>> 
>>>>>>>>> #SCRITP1
>>>>>>>>> mpirun -report-bindings -bind-to core Myprog
>>>>>>>>> 
>>>>>>>>> #SCRIPT2
>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
>>> -bind-to
>>>>>>> core
>>>>>>>> 
>>>>>>>> This version is not only reading the PBS allocation, but also
>>> invoking
>>>>>>> the hostfile filter on top of it. Different code path. I'll take a
>>> look
>>>>> -
>>>>>>> it should still match up assuming NPROCS=8. Any
>>>>>>>> possibility that it is a different number? I don't recall, but
> isn't
>>>>>>> there some extra lines in the nodefile - e.g., comments?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Myprog
>>>>>>>>> 
>>>>>>>>> tmishima
>>>>>>>>> 
>>>>>>>>>> I guess here's my confusion. If you are using only one node, and
>>>>> that
>>>>>>>>> node has 8 allocated slots, then we will not allow you to run
> more
>>>>> than
>>>>>>> 8
>>>>>>>>> processes on that node unless you specifically provide
>>>>>>>>>> the --oversubscribe flag. This is because you are operating in a
>>>>>>> managed
>>>>>>>>> environment (in this case, under Torque), and so we treat the
>>>>>>> allocation as
>>>>>>>>> "mandatory" by default.
>>>>>>>>>> 
>>>>>>>>>> I suspect that is the issue here, in which case the system is
>>>>> behaving
>>>>>>> as
>>>>>>>>> it should.
>>>>>>>>>> 
>>>>>>>>>> Is the above accurate?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Nov 13, 2013, at 4:11 PM, Ralph Castain <r...@open-mpi.org>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> It has nothing to do with LAMA as you aren't using that mapper.
>>>>>>>>>>> 
>>>>>>>>>>> How many nodes are in this allocation?
>>>>>>>>>>> 
>>>>>>>>>>> On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ralph, this is an additional information.
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is the main part of output by adding "-mca
>>> rmaps_base_verbose
>>>>>>>>> 50".
>>>>>>>>>>>> 
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
> creating
>>>>> map
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only
> HNP
>>> in
>>>>>>>>>>>> allocation
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps: creating new map for job
>>>>> [56581,1]
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
>>> ppr
>>>>>>>>> mapper
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
>>>>> [56581,1]
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
>>> seq
>>>>>>>>> mapper
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
>>> initial
>>>>>>> map
>>>>>>>>> of
>>>>>>>>>>>> job [56581,1] - no fault groups
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
>>> using
>>>>>>>>> mindist
>>>>>>>>>>>> mapper
>>>>>>>>>>>> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
>>> list
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
>>>>>>>>>>>> [node08.cluster:26952] [[56581,0],0] Removing node node08
> slots
>>> 0
>>>>>>>>> inuse 0
>>>>>>>>>>>> 
>>>>>>>>>>>> From this result, I guess it's related to oversubscribe.
>>>>>>>>>>>> So I added "-oversubscribe" and rerun, then it worked well as
>>> show
>>>>>>>>> below:
>>>>>>>>>>>> 
>>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
>>> list
>>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
>>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
>>>>>>>>>>>> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
>>>>>>>>>>>> [node08.cluster:27019]     node: node08 daemon: 0
>>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] [[56774,0],0] Starting at node node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
>>>>>>> [56774,1]
>>>>>>>>>>>> slots 1 num_procs 8
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
>>>>>>>>> skipping
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
>>>>>>>>> oversubscribed -
>>>>>>>>>>>> performing second pass
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs
> to
>>>>>>> node
>>>>>>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
>>> for
>>>>>>> job
>>>>>>>>>>>> [56774,1]
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to
> node
>>>>>>> node08
>>>>>>>>>>>> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to
> node
>>>>>>> node08
>>>>>>>>>>>> 
>>>>>>>>>>>> I think something is wrong with treatment of oversubscription,
>>>>> which
>>>>>>>>> might
>>>>>>>>>>>> be
>>>>>>>>>>>> related to "#3893: LAMA mapper has problems"
>>>>>>>>>>>> 
>>>>>>>>>>>> tmishima
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hmmm...looks like we aren't getting your allocation. Can you
>>>>> rerun
>>>>>>>>> and
>>>>>>>>>>>> add -mca ras_base_verbose 50?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp
> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the output of "-mca plm_base_verbose 5".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>>>> component
>>>>>>>>> [rsh]
>>>>>>>>>>>>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
>>>>>>>>>>>>>> agent /usr/bin/rsh path NULL
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
>>>>> component
>>>>>>>>> [rsh]
>>>>>>>>>>>> set
>>>>>>>>>>>>>> priority to 10
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>>>> component
>>>>>>>>>>>> [slurm]
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Skipping
>>>>> component
>>>>>>>>>>>> [slurm].
>>>>>>>>>>>>>> Query failed to return a module
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Querying
>>>>> component
>>>>>>>>> [tm]
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Query of
>>>>> component
>>>>>>>>> [tm]
>>>>>>>>>>>> set
>>>>>>>>>>>>>> priority to 75
>>>>>>>>>>>>>> [node08.cluster:23573] mca:base:select:(  plm) Selected
>>>>> component
>>>>>>>>> [tm]
>>>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
>>> 23573
>>>>>>>>>>>> nodename
>>>>>>>>>>>>>> hash 85176670
>>>>>>>>>>>>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
>>> 59480
>>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
>>> comm
>>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
>>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
>>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
>>> creating
>>>>>>> map
>>>>>>>>>>>>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
>>> HNP
>>>>> in
>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>>>>>>>>>>> All nodes which are allocated for this job are already
> filled.
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here, openmpi's configuration is as follows:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ./configure \
>>>>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
>>>>>>>>>>>>>> --with-tm \
>>>>>>>>>>>>>> --with-verbs \
>>>>>>>>>>>>>> --disable-ipv6 \
>>>>>>>>>>>>>> --disable-vt \
>>>>>>>>>>>>>> --enable-debug \
>>>>>>>>>>>>>> CC=pgcc CFLAGS="-tp k8-64e" \
>>>>>>>>>>>>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
>>>>>>>>>>>>>> F77=pgfortran FFLAGS="-tp k8-64e" \
>>>>>>>>>>>>>> FC=pgfortran FCFLAGS="-tp k8-64e"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Okey, I can help you. Please give me some time to report
> the
>>>>>>>>> output.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I can try, but I have no way of testing Torque any more -
> so
>>>>> all
>>>>>>> I
>>>>>>>>>>>> can
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> is a code review. If you can build --enable-debug and add
>>> -mca
>>>>>>>>>>>>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
>>> the
>>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp
>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thank you for your quick response.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'd like to report one more regressive issue about Torque
>>>>>>> support
>>>>>>>>> of
>>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893:
>>> LAMA
>>>>>>>>> mapper
>>>>>>>>>>>>>>>>> has problems" I reported a few days ago.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The script below does not work with
> openmpi-1.7.4a1r29646,
>>>>>>>>>>>>>>>>> although it worked with openmpi-1.7.3 as I told you
> before.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> #!/bin/sh
>>>>>>>>>>>>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>>>>>>>>>>>>> export OMP_NUM_THREADS=1
>>>>>>>>>>>>>>>>> cd $PBS_O_WORKDIR
>>>>>>>>>>>>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>>>>>>>>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>>>>>>>>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS}
>>> -report-bindings
>>>>>>>>>>>> -bind-to
>>>>>>>>>>>>>>> core
>>>>>>>>>>>>>>>>> Myprog
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then
> it
>>>>>>> works
>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>> Since this happens without lama request, I guess it's not
>>> the
>>>>>>>>>>>> problem>>>>>>>>>>>>>>> in lama itself. Anyway, please look
> into this issue as
>>> well.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Done - thanks!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp
>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Dear openmpi developers,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I got a segmentation fault in traial use of
>>>>>>>>> openmpi-1.7.4a1r29646
>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> PGI13.10 as shown below:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
>>>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>>>> 2
>>>>>>>>>>>>>>>>>>> -report-bindings mPre
>>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0
> [core
>>> 4
>>>>>>> [hwt
>>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1
> [core
>>> 6
>>>>>>> [hwt
>>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0
> [core
>>> 0
>>>>>>> [hwt
>>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>>>>>>>>>>>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0
> [core
>>> 2
>>>>>>> [hwt
>>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>>>>>>>>>>>>>>>>>> [manage:23082] *** Process received signal ***
>>>>>>>>>>>>>>>>>>> [manage:23082] Signal: Segmentation fault (11)
>>>>>>>>>>>>>>>>>>> [manage:23082] Signal code: Address not mapped (1)
>>>>>>>>>>>>>>>>>>> [manage:23082] Failing at address: 0x34
>>>>>>>>>>>>>>>>>>> [manage:23082] *** End of error message ***
>>>>>>>>>>>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
>>>>> core.23082
>>>>>>>>>>>>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>>>>>>>>>>>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
>>>>>>>>>>>>>> -report-bindings
>>>>>>>>>>>>>>>>>>> mPre'.
>>>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect>>>
>>> (mod=0x5f861ca20b00007f,
>>>>>>>>>>>>>>>>> sd=32767,
>>>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>>>>>>>>>>>>> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>>>>>>> #0  0x00002b5f861c9c4f in recv_connect
>>>>>>> (mod=0x5f861ca20b00007f,
>>>>>>>>>>>>>>>>> sd=32767,
>>>>>>>>>>>>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>>>>>>>>>>>>> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
>>>>>>>>>>>> flags=32767,
>>>>>>>>>>>>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>>>>>>>>>>>>>>>>>> #2  0x00002b5f848eb06a in
>>> event_process_active_single_queue
>>>>>>>>>>>>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>>>>>>>>>>>>>>>>>> at ./event.c:1366
>>>>>>>>>>>>>>>>>>> #3  0x00002b5f848eb270 in event_process_active
>>>>>>>>>>>>>>>>> (base=0x5f848eb84900007f)
>>>>>>>>>>>>>>>>>>> at ./event.c:1435
>>>>>>>>>>>>>>>>>>> #4  0x00002b5f848eb849 in
>>> opal_libevent2021_event_base_loop
>>>>>>>>>>>>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>>>>>>>>>>>>>>>>>> #5  0x00000000004077a0 in orterun (argc=7,
>>>>>>> argv=0x7fff25bbd4a8)
>>>>>>>>>>>>>>>>>>> at ./orterun.c:1030
>>>>>>>>>>>>>>>>>>> #6  0x00000000004067fb in main (argc=7,
>>>>> argv=0x7fff25bbd4a8)
>>>>>>>>>>>>>>>>> at ./main.c:13
>>>>>>>>>>>>>>>>>>> (gdb) quit
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is
> apparently
>>>>>>>>>>>>>> unnecessary,
>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>> causes the segfault.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 624      /* lookup the corresponding process
>>>>> */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
>>> origin);
>>>>>>>>>>>>>>>>>>> 626      if (NULL == peer) {
>>>>>>>>>>>>>>>>>>> 627          ui64 = (uint64_t*)(&peer->name);
>>>>>>>>>>>>>>>>>>> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>>>>>>>>>>>>>>>>>> orte_oob_base_framework.framework_output,
>>>>>>>>>>>>>>>>>>> 629                              "%s
>>>>>>> mca_oob_tcp_recv_connect:
>>>>>>>>>>>>>>>>>>> connection from new peer",
>>>>>>>>>>>>>>>>>>> 630                              ORTE_NAME_PRINT
>>>>>>>>>>>>>>> (ORTE_PROC_MY_NAME));
>>>>>>>>>>>>>>>>>>> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>>>>>>>>>>>>> 632          peer->mod = mod;
>>>>>>>>>>>>>>>>>>> 633          peer->name = hdr->origin;
>>>>>>>>>>>>>>>>>>> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
>>>>>>>>>>>>>>>>>>> 635          ui64 = (uint64_t*)(&peer->name);
>>>>>>>>>>>>>>>>>>> 636          if (OPAL_SUCCESS !=
>>>>>>>>> opal_hash_table_set_value_uint64
>>>>>>>>>>>>>>>>> (&mod->
>>>>>>>>>>>>>>>>>>> peers, (*ui64), peer)) {
>>>>>>>>>>>>>>>>>>> 637              OBJ_RELEASE(peer);
>>>>>>>>>>>>>>>>>>> 638              return;
>>>>>>>>>>>>>>>>>>> 639          }
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Please fix this mistake in the next release.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

Reply via email to