Agreed that the original program had the char*[20]/char[20] bug, but his segv is occurring before trying to use that array. So it's a bug - but he just hadn't hit it yet. :-)
I'd still like to see a debugging version so that we can get a real stack trace, and/or try the latest 1.4.4 RC (posted yesterday). On Sep 27, 2011, at 3:08 PM, German Hoecht wrote: > char* name[20]; yields 20 (undefined) pointers to char, guess you mean > char name[20]; > > So Brent's suggestion should work as well(?) > > To be safe I would also add: > gethostname(name,maxlen); > name[19] = '\0'; > printf("Hello, world. I am %d of %d and host %s \n", rank, ... > > Cheers > > On 09/27/2011 07:40 PM, Phillip Vassenkov wrote: >> Thanks, but my main concern is the segfault :P I changed and as I >> expected it still segfaults. >> >> On 9/27/11 9:48 AM, Henderson, Brent wrote: >>> Here is another possibly non-helpful suggestion. :) Change: >>> >>> char* name[20]; >>> int maxlen = 20; >>> >>> To: >>> >>> char name[256]; >>> int maxlen = 256; >>> >>> gethostname() is supposed to properly truncate the hostname it returns >>> if the actual name is longer than the length provided, but since you >>> have at least one that is longer than 20 characters, I'm curious. >>> >>> Brent >>> >>> >>> -----Original Message----- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>> On Behalf Of Jeff Squyres >>> Sent: Tuesday, September 27, 2011 6:29 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node >>> >>> Hmm. It's not immediately clear to me what's going wrong here. >>> >>> I hate to ask, but could you install a debugging version of Open MPI >>> and capture a proper stack trace of the segv? >>> >>> Also, could you try the 1.4.4 rc and see if that magically fixes the >>> problem? (I'm about to post a new 1.4.4 rc later this morning, but >>> either the current one or the one from later today would be a good >>> datapoint) >>> >>> >>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote: >>> >>>> Yep, Fedora Core 14 and OpenMPI 1.4.3 >>>> >>>> On 9/24/11 7:02 AM, Jeff Squyres wrote: >>>>> Are you running the same OS version and Open MPI version between the >>>>> head node and regular nodes? >>>>> >>>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: >>>>> >>>>>> Hey all, >>>>>> I've been racking my brains over this for several days and was >>>>>> hoping anyone could enlighten me. I'll describe only the relevant >>>>>> parts of the network/computer systems. There is one head node and a >>>>>> multitude of regular nodes. The regular nodes are all identical to >>>>>> each other. If I run an mpi program from one of the regular nodes >>>>>> to any other regular nodes, everything works. If I include the head >>>>>> node in the hosts file, I get segfaults which I'll paste below >>>>>> along with sample code. The machines are all networked via >>>>>> infiniband and Ethernet. The issue only arises when mpi >>>>>> communication occurs. By this I mean, MPi_Init might succeed but >>>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found >>>>>> a work around by disabling the openib btl and enforcing that >>>>>> communications go over infiniband(if I don't force infiniband, >>>>>> it'll go over Ethernet). This command works when the head node is >>>>>> included in the hosts file: >>>>>> mpirun --hostfile hostfile --mca btl ^openib --mca >>>>>> btl_tcp_if_include ib0 -np 2 ./b.out >>>>>> >>>>>> Sample Code: >>>>>> #include "mpi.h" >>>>>> #include<stdio.h> >>>>>> int main(int argc, char *argv[]) >>>>>> { >>>>>> int rank, nprocs; >>>>>> char* name[20]; >>>>>> int maxlen = 20; >>>>>> MPI_Init(&argc,&argv); >>>>>> MPI_Comm_size(MPI_COMM_WORLD,&nprocs); >>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank); >>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>> gethostname(name,maxlen); >>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank, >>>>>> nprocs,name); >>>>>> fflush(stdout); >>>>>> MPI_Finalize(); >>>>>> return 0; >>>>>> >>>>>> } >>>>>> >>>>>> Segfault: >>>>>> [pastec:19917] *** Process received signal *** >>>>>> [pastec:19917] Signal: Segmentation fault (11) >>>>>> [pastec:19917] Signal code: Address not mapped (1) >>>>>> [pastec:19917] Failing at address: 0x8 >>>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0] >>>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) >>>>>> [0x7eff6430b6aa] >>>>>> [pastec:19917] [ 2] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) >>>>>> [0x7eff66a163c9] >>>>>> [pastec:19917] [ 3] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) >>>>>> [0x7eff66a21b70] >>>>>> [pastec:19917] [ 4] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) >>>>>> [0x7eff66a21c89] >>>>>> [pastec:19917] [ 5] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) >>>>>> [0x7eff66a1703d] >>>>>> [pastec:19917] [ 6] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) >>>>>> [0x7eff676670e6] >>>>>> [pastec:19917] [ 7] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) >>>>>> [0x7eff6765b273] >>>>>> [pastec:19917] [ 8] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) >>>>>> [0x7eff65539b2f] >>>>>> [pastec:19917] [ 9] >>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) >>>>>> [0x7eff655425cf] >>>>>> [pastec:19917] [10] >>>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e] >>>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42] >>>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) >>>>>> [0x34a841ee5d] >>>>>> [pastec:19917] [13] ./b.out() [0x400919] >>>>>> [pastec:19917] *** End of error message *** >>>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] >>>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> mpirun noticed that process rank 1 with PID 19917 on node >>>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault). >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/