[OMPI users] SIGSEGV in OMPI 1.6.x
Hi, While debugging a mysterious crash of a code, I was able to trace down to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in opal/mca/memory/linux/malloc.c. Please see the following gdb log. (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000) at malloc.c:4385 4385 nextsize = chunksize(nextchunk); (gdb) l 4380 Consolidate other non-mmapped chunks as they arrive. 4381*/ 4382 4383else if (!chunk_is_mmapped(p)) { 4384 nextchunk = chunk_at_offset(p, size); 4385 nextsize = chunksize(nextchunk); 4386 assert(nextsize > 0); 4387 4388 /* consolidate backward */ 4389 if (!prev_inuse(p)) { (gdb) bt #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000) at malloc.c:4385 #1 0x2ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637) at malloc.c:3511 #2 0x2ae6b18ea736 in opal_memory_linux_free_hook (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705 #3 0x01412fcc in for_dealloc_allocatable () #4 0x007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647, name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0 ) at alloc.F90:1357 #5 0x0082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5, na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff, lasto=..., iphorb=..., numd=..., listdptr=..., listd=..., numh=..., listhptr=..., listh=..., nspin=@0xcf4ff0002, dscf=..., eldau=@0x0, deldau=@0x0, fa=..., stress=..., h=..., first=@0x0, last=@0x0) at ldau.F:752 #6 0x006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199 #7 0x0070e257 in M_SIESTA_FORCES::siesta_forces (istep=@0xf9a4d070) at siesta_forces.F:90 #8 0x0070e475 in siesta () at siesta.F:23 #9 0x0045e47c in main () Can anybody shed some light here on what could be wrong? Thanks, Yong Qin
Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
Hi Shiqing, I have solved the problem with the double quotes in OPENMPI_HOME but there is still something wrong. set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1" mpicc init_finalize.c Cannot open configuration file "c:\Program Files (x86)\openmpi-1.6.1"/share/openmpi\mpicc-wrapper-data.txt Error parsing data file mpicc: Not found Everything is OK if you remove the double quotes which Windows automatically adds. set OPENMPI_HOME=c:\Program Files (x86)\openmpi-1.6.1 mpicc init_finalize.c Microsoft (R) 32-Bit C/C++-Optimierungscompiler Version 16.00.40219.01 für 80x86 ... mpiexec init_finalize.exe -- WARNING: An invalid value was given for btl_tcp_if_exclude. This value will be ignored. Local host: hermes Value: 127.0.0.1/8 Message:Did not find interface matching this subnet -- Hello! I get the output from my program but also a warning from Open MPI. The new value for the loopback device was introduced a short time ago when I have had problems with the loopback device on Solaris (it used "lo0" instead of your default "lo"). How can I avoid this message? The 64-bit version of my program still hangs. Kind regards Siegmar > > Could you try set OPENMPI_HOME env var to the root of the Open MPI dir? > > This env is a backup option for the registry. > > It solves one problem but there is a new problem now :-(( > > > Without OPENMPI_HOME: Wrong pathname to help files. > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > -- > Sorry! You were supposed to get help about: > invalid if_inexclude > But I couldn't open the help file: > D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt: > No such file or directory. Sorry! > -- > ... > > > > With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately > the pathname contains the character " in the wrong place so that it > couldn't find the available help file. > > set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1" > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > -- > Sorry! You were supposed to get help about: > no-hostfile > But I couldn't open the help file: > "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: > Invalid argument. Sorry > ! > -- > [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mca\ras\base > \ras_base_allocate.c at line 200 > [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mca\plm\base > \plm_base_launch_support.c at line 99 > [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mca\plm\proc > ess\plm_process_module.c at line 996 > > > > It looks like that the environment variable can also solve my > problem in the 64-bit environment. > > D:\g...\prog\mpi\small_prog>mpicc init_finalize.c > > Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 für x64 > ... > > > The process hangs without OPENMPI_HOME. > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > ^C > > > With OPENMPI_HOME: > > set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1" > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > -- > Sorry! You were supposed to get help about: > no-hostfile > But I couldn't open the help file: > "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid > argument. S > orry! > -- > [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mc > a\ras\base\ras_base_allocate.c at line 200 > [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mc > a\plm\base\plm_base_launch_support.c at line 99 > [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mc > a\plm\process\plm_process_module.c at line 996 > > > At least the program doesn't block any longer. Do you have any ideas > how this new problem can be solved? > > > Kind regards > > Siegmar > > > > > On 2012-09-05 1:02 PM, Siegmar Gross wrote: > > > Hi Shiqing, > > > > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > > - > > Sorry! You were supposed to get help about: > > invalid if_inexclude > > But I couldn't open the help file: > > D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt: >
Re: [OMPI users] Infiniband performance Problem and stalling
On 9/3/2012 4:14 AM, Randolph Pullen wrote: > No RoCE, Just native IB with TCP over the top. Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card". Could you run "ibstat" and post the results? What is the expected BW on your cards? Could you run "ib_write_bw" between two machines? Also, please see below. > No I haven't used 1.6 I was trying to stick with the standards on the > mellanox disk. > Is there a known problem with 1.4.3 ? > > -- > *From:* Yevgeny Kliteynik > *To:* Randolph Pullen ; Open MPI Users > > *Sent:* Sunday, 2 September 2012 10:54 PM > *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling > > Randolph, > > Some clarification on the setup: > > "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to > Ethernet? > That is, when you're using openib BTL, you mean RoCE, right? > > Also, have you had a chance to try some newer OMPI release? > Any 1.6.x would do. > > > -- YK > > On 8/31/2012 10:53 AM, Randolph Pullen wrote: > > (reposted with consolidatedinformation) > > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G > cards > > running Centos 5.7 Kernel 2.6.18-274 > > Open MPI 1.4.3 > > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2): > > On a Cisco 24 pt switch > > Normal performance is: > > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong > > results in: > > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec > > and: > > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong > > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec > > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems > fine. > > log_num_mtt =20 and log_mtts_per_seg params =2 > > My application exchanges about a gig of data between the processes with 2 > sender and 2 consumer processes on each node with 1 additional controller > process on the starting node. > > The program splits the data into 64K blocks and uses non blocking sends > and receives with busy/sleep loops to monitor progress until completion. > > Each process owns a single buffer for these 64K blocks. > > My problem is I see better performance under IPoIB then I do on native IB > (RDMA_CM). > > My understanding is that IPoIB is limited to about 1G/s so I am at a loss > to know why it is faster. > > These 2 configurations are equivelant (about 8-10 seconds per cycle) > > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl > tcp,self -H vh2,vh1 -np 9 --bycore prog > > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl > tcp,self -H vh2,vh1 -np 9 --bycore prog When you say "--mca btl tcp,self", it means that openib btl is not enabled. Hence "--mca btl_openib_flags" is irrelevant. > > And this one produces similar run times but seems to degrade with repeated > cycles: > > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl > openib,self -H vh2,vh1 -np 9 --bycore prog You're running 9 ranks on two machines, but you're using IB for intra-node communication. Is it intentional? If not, you can add "sm" btl and have performance improved. -- YK > > Other btl_openib_flags settings result in much lower performance. > > Changing the first of the above configs to use openIB results in a 21 > second run time at best. Sometimes it takes up to 5 minutes. > > In all cases, OpenIB runs in twice the time it takes TCP,except if I push > the small message max to 64K and force short messages. Then the openib times > are the same as TCP and no faster. > > With openib: > > - Repeated cycles during a single run seem to slow down with each cycle > > (usually by about 10 seconds). > > - On occasions it seems to stall indefinitely, waiting on a single receive. > > I'm still at a loss as to why. I can’t find any errors logged during the > runs. > > Any ideas appreciated. > > Thanks in advance, > > Randolph > > > > > > _
Re: [OMPI users] SIGSEGV in OMPI 1.6.x
If you run into a segv in this code, it almost certainly means that you have heap corruption somewhere. FWIW, that has *always* been what it meant when I've run into segv's in any code under in opal/mca/memory/linux/. Meaning: my user code did something wrong, it created heap corruption, and then later some malloc() or free() caused a segv in this area of the code. This code is the same ptmalloc memory allocator that has shipped in glibc for years. I'll be hard-pressed to say that any code is 100% bug free :-), but I'd be surprised if there is a bug in this particular chunk of code. I'd run your code through valgrind or some other memory-checking debugger and see if that can shed any light on what's going on. On Sep 6, 2012, at 12:06 AM, Yong Qin wrote: > Hi, > > While debugging a mysterious crash of a code, I was able to trace down > to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in > opal/mca/memory/linux/malloc.c. Please see the following gdb log. > > (gdb) c > Continuing. > > Program received signal SIGSEGV, Segmentation fault. > opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000) > at malloc.c:4385 > 4385 nextsize = chunksize(nextchunk); > (gdb) l > 4380 Consolidate other non-mmapped chunks as they arrive. > 4381*/ > 4382 > 4383else if (!chunk_is_mmapped(p)) { > 4384 nextchunk = chunk_at_offset(p, size); > 4385 nextsize = chunksize(nextchunk); > 4386 assert(nextsize > 0); > 4387 > 4388 /* consolidate backward */ > 4389 if (!prev_inuse(p)) { > (gdb) bt > #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637, > mem=0x203a746f74512000) at malloc.c:4385 > #1 0x2ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637) > at malloc.c:3511 > #2 0x2ae6b18ea736 in opal_memory_linux_free_hook > (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705 > #3 0x01412fcc in for_dealloc_allocatable () > #4 0x007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647, > name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0 > ) at alloc.F90:1357 > #5 0x0082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5, > na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff, > lasto=..., iphorb=..., >numd=..., listdptr=..., listd=..., numh=..., listhptr=..., > listh=..., nspin=@0xcf4ff0002, dscf=..., eldau=@0x0, deldau=@0x0, > fa=..., stress=..., h=..., >first=@0x0, last=@0x0) at ldau.F:752 > #6 0x006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian > (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199 > #7 0x0070e257 in M_SIESTA_FORCES::siesta_forces > (istep=@0xf9a4d070) at siesta_forces.F:90 > #8 0x0070e475 in siesta () at siesta.F:23 > #9 0x0045e47c in main () > > Can anybody shed some light here on what could be wrong? > > Thanks, > > Yong Qin > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Regarding the Pthreads
Your question is somewhat outside the scope of this list. Perhaps people may chime in with some suggestions, but that's more of a threading question than an MPI question. Be warned that you need to call MPI_Init_thread (not MPI_Init) with MPI_THREAD_MULTIPLE in order to get true multi-threaded support in Open MPI. And we only support that on the TCP and shared memory transports if you built Open MPI with threading support enabled. On Sep 5, 2012, at 2:23 PM, seshendra seshu wrote: > Hi, > I am learning pthreads and trying to implement the pthreads in my quicksort > program. > My problem is iam unable to understand how to implement the pthreads at data > received at a node from the master (In detail: In my program Master will > divide the data and send to the slaves and each slave will do the sorting > independently of The received data and send back to master after sorting is > done. Now Iam having a problem in Implementing the pthreads at the slaves,i.e > how to implement the pthreads in order to share data among the cores in each > slave and sort the data and send it back to master. > So could anyone help in solving this problem by providing some suggestions > and clues. > > Thanking you very much. > > -- > WITH REGARDS > M.L.N.Seshendra > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] python-mrmpi() failed
On Sep 4, 2012, at 3:09 PM, mariana Vargas wrote: > I 'am new in this, I have some codes that use mpi for python and I > just installed (openmpi, mrmpi, mpi4py) in my home (from a cluster > account) without apparent errors and I tried to perform this simple > test in python and I get the following error related with openmpi, > could you help to figure out what is going on? I attach as many > informations as possible... I think I know what's happening here. It's a complicated linker issue that we've discussed before -- I'm not sure whether it was on this users list or the OMPI developers list. The short version is that you should remove your prior Open MPI installation, and then rebuild Open MPI with the --disable-dlopen configure switch. See if that fixes the problem. > Thanks. > > Mariana > > > From a python console > >>> from mrmpi import mrmpi > >>> mr=mrmpi() > [ferrari:23417] mca: base: component_find: unable to open /home/ > mvargas/lib/openmpi/mca_paffinity_hwloc: /home/mvargas/lib/openmpi/ > mca_paffinity_hwloc.so: undefined symbol: opal_hwloc_topology (ignored) > [ferrari:23417] mca: base: component_find: unable to open /home/ > mvargas/lib/openmpi/mca_carto_auto_detect: /home/mvargas/lib/openmpi/ > mca_carto_auto_detect.so: undefined symbol: > opal_carto_base_graph_get_host_graph_fn (ignored) > [ferrari:23417] mca: base: component_find: unable to open /home/ > mvargas/lib/openmpi/mca_carto_file: /home/mvargas/lib/openmpi/ > mca_carto_file.so: undefined symbol: > opal_carto_base_graph_get_host_graph_fn (ignored) > [ferrari:23417] mca: base: component_find: unable to open /home/ > mvargas/lib/openmpi/mca_shmem_mmap: /home/mvargas/lib/openmpi/ > mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored) > [ferrari:23417] mca: base: component_find: unable to open /home/ > mvargas/lib/openmpi/mca_shmem_posix: /home/mvargas/lib/openmpi/ > mca_shmem_posix.so: undefined symbol: opal_show_help (ignored) > [ferrari:23417] mca: base: component_find: unable to open /home/ > mvargas/lib/openmpi/mca_shmem_sysv: /home/mvargas/lib/openmpi/ > mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored) > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_shmem_base_select failed > --> Returned value -1 instead of OPAL_SUCCESS > -- > [ferrari:23417] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 79 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: orte_init failed > --> Returned "Error" (-1) instead of "Success" (0) > -- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > [ferrari:23417] Local abort before MPI_INIT completed successfully; > not able to aggregate error messages, and not able to guarantee that > all other processes were killed! > > > > echo $PATH > > /home/mvargas/idl/pro/LibsSDSSS/idlutilsv5_4_15/bin:/usr/local/itt/ > idl70/bin:/opt/local/bin:/home/mvargas/bin:/home/mvargas/lib:/home/ > mvargas/lib/openmpi/:/home/mvargas:/home/vargas/bin/:/home/mvargas/idl/ > pro/LibsSDSSS/idlutilsv5_4_15/bin:/usr/local/itt/idl70/bin:/opt/local/ > bin:/home/mvargas/bin:/home/mvargas/lib:/home/mvargas/lib/openmpi/:/ > home/mvargas:/home/vargas/bin/:/usr/lib64/qt3.3/bin:/usr/kerberos/bin:/ > usr/local/bin:/bin:/usr/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/ > envswitcher/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX64:/opt/pvm3/bin/ > LINUX64:/opt/c3-4/ > > echo $LD_LIBRARY_PATH > /usr/local/mpich2/lib:/home/mvargas/lib:/home/mvargas/:/home/mvargas/ > lib64:/home/mvargas/lib/openmpi/:/usr/lib64/openmpi/1.4-gcc/lib/:/user/ > local/:/usr/local/mpich2/lib:/home/mvargas/lib:/home/mvargas/:/home/ > mvargas/lib64:/home/mvargas/lib/openmpi/:/usr/lib64/openmpi/1.4-gcc/ > lib/:/user/local/: > > Version: openmpi-1.6 > > > > mpirun --bynode --tag-output ompi_info -v ompi full --parsable > [1,0]:package:Open MPI mvargas@ferrari Distribution > [1,0]:ompi:version:full:1.6 > [1,0]:ompi:
Re: [OMPI users] MPI_Cart_sub periods
John -- This cartesian stuff always makes my head hurt. :-) You seem to have hit on a bona-fide bug. I have fixed the issue in our SVN trunk and will get the fixed moved over to the v1.6 and v1.7 branches. Thanks for the report! On Aug 29, 2012, at 5:32 AM, Craske, John wrote: > Hello, > > We are partitioning a two-dimensional Cartesian communicator into > two one-dimensional subgroups. In this situation we have found > that both one-dimensional communicators inherit the period > logical of the first dimension of the original two-dimensional > communicator when using Open MPI. Using MPICH each > one-dimensional communicator inherits the period corresponding to > the dimensions specified in REMAIN_DIMS, as expected. Could this > be a bug, or are we making a mistake? The relevant calls we make in a > Fortran code are > > CALL MPI_CART_CREATE(MPI_COMM_WORLD, 2, (/ NDIMX, NDIMY /), (/ .True., > .False. /), .TRUE., > COMM_CART_2D, IERROR) > > CALL MPI_CART_SUB(COMM_CART_2D, (/ .True., .False. /), COMM_CART_X, IERROR) > CALL MPI_CART_SUB(COMM_CART_2D, (/ .False., .True. /), COMM_CART_Y, IERROR) > > Following these requests, > > CALL MPI_CART_GET(COMM_CART_X, MAXDIM_X, DIMS_X, PERIODS_X, COORDS_X, IERROR) > CALL MPI_CART_GET(COMM_CART_Y, MAXDIM_Y, DIMS_Y, PERIODS_Y, COORDS_Y, IERROR) > > will result in > > PERIODS_X = T > PERIODS_Y = T > > If, on the other hand we define the two-dimensional communicator > using PERIODS = (/ .False., .True. /), we find > > PERIODS_X = F > PERIODS_Y = F > > Your advice on the matter would be greatly appreciated. > > Regards, > > John. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
Hi Siegmar, Glad to hear that it's working for you. The warning message is because the loopback adapter is excluded by default, but this adapter is actually not installed on Windows. One solution might be installing the loopback adapter on Windows. It very easy, only a few minutes. Or it may be possible to avoid this message from internal Open MPI. But I'm not sure about how this can be done. Regards, Shiqing On 2012-09-06 7:48 AM, Siegmar Gross wrote: Hi Shiqing, I have solved the problem with the double quotes in OPENMPI_HOME but there is still something wrong. set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1" mpicc init_finalize.c Cannot open configuration file "c:\Program Files (x86)\openmpi-1.6.1"/share/openmpi\mpicc-wrapper-data.txt Error parsing data file mpicc: Not found Everything is OK if you remove the double quotes which Windows automatically adds. set OPENMPI_HOME=c:\Program Files (x86)\openmpi-1.6.1 mpicc init_finalize.c Microsoft (R) 32-Bit C/C++-Optimierungscompiler Version 16.00.40219.01 für 80x86 ... mpiexec init_finalize.exe -- WARNING: An invalid value was given for btl_tcp_if_exclude. This value will be ignored. Local host: hermes Value: 127.0.0.1/8 Message:Did not find interface matching this subnet -- Hello! I get the output from my program but also a warning from Open MPI. The new value for the loopback device was introduced a short time ago when I have had problems with the loopback device on Solaris (it used "lo0" instead of your default "lo"). How can I avoid this message? The 64-bit version of my program still hangs. Kind regards Siegmar Could you try set OPENMPI_HOME env var to the root of the Open MPI dir? This env is a backup option for the registry. It solves one problem but there is a new problem now :-(( Without OPENMPI_HOME: Wrong pathname to help files. D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe -- Sorry! You were supposed to get help about: invalid if_inexclude But I couldn't open the help file: D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt: No such file or directory. Sorry! -- ... With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately the pathname contains the character " in the wrong place so that it couldn't find the available help file. set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1" D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe -- Sorry! You were supposed to get help about: no-hostfile But I couldn't open the help file: "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. Sorry ! -- [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\ras\base \ras_base_allocate.c at line 200 [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\base \plm_base_launch_support.c at line 99 [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\proc ess\plm_process_module.c at line 996 It looks like that the environment variable can also solve my problem in the 64-bit environment. D:\g...\prog\mpi\small_prog>mpicc init_finalize.c Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 für x64 ... The process hangs without OPENMPI_HOME. D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe ^C With OPENMPI_HOME: set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1" D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe -- Sorry! You were supposed to get help about: no-hostfile But I couldn't open the help file: "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. S orry! -- [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc a\ras\base\ras_base_allocate.c at line 200 [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc a\plm\base\plm_base_launch_support.c at line 99 [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc a\plm\process\plm_process_module.c at line 996 At least the program doesn't block any longer. Do you have any ideas how this new problem can be solved? Kind regards Siegmar On 2012-09-05 1:02 PM, Siegmar Gross wrote: Hi Shiqing, D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe -
[OMPI users] MPI_Allreduce fail (minGW gfortran + OpenMPI 1.6.1)
Dear mpi users and developers, I am having some trouble with MPI_Allreduce. I am using MinGW (gcc 4.6.2) with OpenMPI 1.6.1. The MPI_Allreduce in c version works fine, but the fortran version failed with error. Here is the simple fortran code to reproduce the error: program main implicit none include 'mpif.h' character * (MPI_MAX_PROCESSOR_NAME) processor_name integer myid, numprocs, namelen, rc, ierr integer, allocatable :: mat1(:, :, :) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) allocate(mat1(-36:36, -36:36, -36:36)) mat1(:,:,:) = 111 print *, "Going to call MPI_Allreduce." call MPI_Allreduce(MPI_IN_PLACE, mat1(-36, -36, -36), 389017, MPI_INTEGER, MPI_BOR, MPI_COMM_WORLD, ierr) print *, "MPI_Allreduce done!!!" call MPI_FINALIZE(rc) endprogram The command that I used to compile: gfortran Allreduce.f90 -IC:\OpenMPI-win32\include C:\OpenMPI-win32\lib\libmpi_f77.lib The MPI_Allreduce fail. [xxx:02112] [[17193,0],0]-[[17193,1],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108). I am not sure why this happens. But I think it is the windows build MPI problem. Since the simple code works on a Linux system with gfortran. Any ideas? I appreciate any response! Yonghui
Re: [OMPI users] SIGSEGV in OMPI 1.6.x
Thanks Jeff. I will definitely do the failure analysis. But just wanted to confirm this isn't something special in OMPI itself, e.g., missing some configuration settings, etc. On Thu, Sep 6, 2012 at 5:01 AM, Jeff Squyres wrote: > If you run into a segv in this code, it almost certainly means that you have > heap corruption somewhere. FWIW, that has *always* been what it meant when > I've run into segv's in any code under in opal/mca/memory/linux/. Meaning: > my user code did something wrong, it created heap corruption, and then later > some malloc() or free() caused a segv in this area of the code. > > This code is the same ptmalloc memory allocator that has shipped in glibc for > years. I'll be hard-pressed to say that any code is 100% bug free :-), but > I'd be surprised if there is a bug in this particular chunk of code. > > I'd run your code through valgrind or some other memory-checking debugger and > see if that can shed any light on what's going on. > > > On Sep 6, 2012, at 12:06 AM, Yong Qin wrote: > >> Hi, >> >> While debugging a mysterious crash of a code, I was able to trace down >> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in >> opal/mca/memory/linux/malloc.c. Please see the following gdb log. >> >> (gdb) c >> Continuing. >> >> Program received signal SIGSEGV, Segmentation fault. >> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000) >> at malloc.c:4385 >> 4385 nextsize = chunksize(nextchunk); >> (gdb) l >> 4380 Consolidate other non-mmapped chunks as they arrive. >> 4381*/ >> 4382 >> 4383else if (!chunk_is_mmapped(p)) { >> 4384 nextchunk = chunk_at_offset(p, size); >> 4385 nextsize = chunksize(nextchunk); >> 4386 assert(nextsize > 0); >> 4387 >> 4388 /* consolidate backward */ >> 4389 if (!prev_inuse(p)) { >> (gdb) bt >> #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637, >> mem=0x203a746f74512000) at malloc.c:4385 >> #1 0x2ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637) >> at malloc.c:3511 >> #2 0x2ae6b18ea736 in opal_memory_linux_free_hook >> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705 >> #3 0x01412fcc in for_dealloc_allocatable () >> #4 0x007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647, >> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0 >> ) at alloc.F90:1357 >> #5 0x0082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5, >> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff, >> lasto=..., iphorb=..., >>numd=..., listdptr=..., listd=..., numh=..., listhptr=..., >> listh=..., nspin=@0xcf4ff0002, dscf=..., eldau=@0x0, deldau=@0x0, >> fa=..., stress=..., h=..., >>first=@0x0, last=@0x0) at ldau.F:752 >> #6 0x006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian >> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199 >> #7 0x0070e257 in M_SIESTA_FORCES::siesta_forces >> (istep=@0xf9a4d070) at siesta_forces.F:90 >> #8 0x0070e475 in siesta () at siesta.F:23 >> #9 0x0045e47c in main () >> >> Can anybody shed some light here on what could be wrong? >> >> Thanks, >> >> Yong Qin >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Open-mx issue with ompi 1.6.1
I built open-mpi 1.6.1 using the open-mx libraries. This worked previously and now I get the following error. Here is my system: kernel: 2.6.32-279.5.1.el6.x86_64 open-mx: 1.5.2 BTW, open-mx worked previously with open-mpi and the current version works with mpich2 $ mpiexec -np 8 -machinefile machines cpi Process 0 on limulus FatalError: Failed to lookup peer by addr, driver replied Bad file descriptor cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion `0' failed. [limulus:04448] *** Process received signal *** [limulus:04448] Signal: Aborted (6) [limulus:04448] Signal code: (-6) [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500] [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5] [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085] [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e] [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0) [0x332462bae0] [limulus:04448] [ 5] /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197) [0x7fb587418b37] [limulus:04448] [ 6] /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55) [0x7fb58741a5d5] [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a) [0x7fb587419c7a] [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c) [0x7fb58741a27c] [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15) [0x7fb587425865] [limulus:04448] [10] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e) [0x7fb5876fe40e] [limulus:04448] [11] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4) [0x7fb5876fbd94] [limulus:04448] [12] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb) [0x7fb58777d6fb] [limulus:04448] [13] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb) [0x7fb58777509b] [limulus:04448] [14] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b) [0x7fb58770b55b] [limulus:04448] [15] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8) [0x7fb58770b8b8] [limulus:04448] [16] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc) [0x7fb587702d8c] [limulus:04448] [17] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78) [0x7fb587712e88] [limulus:04448] [18] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130) [0x7fb5876ce1b0] [limulus:04448] [19] cpi(main+0x10b) [0x400cc4] [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd) [0x332461ecdd] [limulus:04448] [21] cpi() [0x400ac9] [limulus:04448] *** End of error message *** Process 2 on limulus Process 4 on limulus Process 6 on limulus Process 1 on n0 Process 7 on n0 Process 3 on n0 Process 5 on n0 -- mpiexec noticed that process rank 0 with PID 4448 on node limulus exited on signal 6 (Aborted). -- -- Doug -- Mailscanner: Clean
Re: [OMPI users] [gridengine users] h_vmem in jobs with mixture of openmpi and openmp
Am 06.09.2012 um 13:21 schrieb Schmidt U.: >> > If h_vmem is defined in the script, what sense is then an additional vf > option in the script ? The h_vmem has per default higher value than vf, so it > must fit first to let the job run. If you want to avoid swapping, both should have the same value anyway. >> >>> pe: >>> pe_nameopenmp_6 >>> slots 3168 >>> user_lists standard >>> xuser_listsNONE >>> start_proc_args/bin/true >>> stop_proc_args /bin/true >>> allocation_rule6 >>> control_slaves TRUE >>> job_is_first_task FALSE >>> urgency_slots min >>> accounting_summary FALSE >>> >>> job script: >>> #$ -N test >>> #$ -cwd >>> #$ -o $JOB_ID.out >>> #$ -e $JOB_ID.err >>> #$ -l h_rt=150:00:00 >>> #$ -l vf=2.3G >>> #$ -l h_vmem=3G >>> #$ -pe openmp_6 72 >>> export OMP_NUM_THREADS=6 >>> export MKL_NUM_THREADS=($OMP_NUM_THREADS) >>> mpirun --mca btl openib,self -pernode -np 12 /my_mixed_job >> With this you will get 12 machines, and on each you can use 6 threads. As >> all (threads on a machine) will work on the same memory, this shouldn't be a >> problem. But you are using MKL with $OMP_NUM_THREADS too, which could create >> locally 36 processes as a result. Therefore I use unthreaded versions of >> MKL/ACML/ATLAS usually. > Thanks for that hint, as a workaround a could check all scripts for > MKL_NUM_THREADS and set it to 1 by jsv ? Yes. But you could also try the opposite, i.e. OMP_NUM_THREADS=1 and MKL_NUM_THREADS=$NSLOTS Depends of the application what's better. -- Reuti > Udo >> >> -- Reuti >> >> >>> ___ >>> users mailing list >>> us...@gridengine.org >>> https://gridengine.org/mailman/listinfo/users >