Re: [OMPI users] Valgrind Functionality
Thanks Ashley, after going through your suggestions we tried our test with valgrind 3.3.0 and with glibc-devel-2.5-18.el5_1.1, both exhibit the same results. A simple non-MPI test prog however returns expected responses, so valgrind itself look ok. We then checked that the same (shared) libc gets linked in both the MPI and non-MPI cases, adding -pthread to the cc command line yields the same result, the only difference it appears is the open mpi libraries. Now mpicc links against libopen-pal which defines malloc for it's own purposes. The big difference seems to be that libopen-pal.so is providing its own malloc replacement Is there perhaps something I have missed in runtime? I have included the ompi_info just in case. Thanks in advance. Tom --- On Tue, 7/8/08, Ashley Pittman wrote: From: Ashley Pittman Subject: Re: [OMPI users] Valgrind Functionality To: rarebit...@yahoo.com, "Open MPI Users" List-Post: users@lists.open-mpi.org Date: Tuesday, July 8, 2008, 2:05 AM On Mon, 2008-07-07 at 19:09 -0700, Tom Riddle wrote: > > I was attempting to get valgrind working with a simple MPI app > (osu_latency) on OpenMPI. While it appears to report uninitialized > values it fails to report any mallocs or frees that have been > conducted. The normal reason for this is either using static applications or having a very stripped glibc. It doesn't appear you've done the former as you are linking in libpthread but the latter is a possibility, you might benefit from installing the glibc-devel package. I don't recalled RHEL being the worst offenders at stripping libc however. > I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled > openmpi-1.3a1r18303. configured with > > $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc > CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker > --with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/ > As the FAQ's suggest I am running a later version of valgrind, > enabling the memchecker and debug. I tested a slightly modified > osu_latency test which has a simple char buffer malloc and free but > the valgrind summary shows no malloc/free activity whatsoever. This is > running on a dual node system using Infinipath HCAs. Here is a > trimmed output. Although you configured openmpi with what appears to be valgrind 3.3.0 the version of valgrind you are using is 3.2.1, perhaps you want to specify the full path of valgrind on the mpirun command line? > [tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile > valgrind ./osu_latency1 > ==17839== Memcheck, a memory error detector. > ==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et > al. > ==17839== Using LibVEX rev 1658, a library for dynamic binary > translation. > ==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. > ==17839== Using valgrind-3.2.1, a dynamic binary instrumentation > framework. > ==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et > al. > ==17839== For more details, rerun with: -v Ashley Pittman. ompi_info.out.bz2 Description: Binary data
Re: [OMPI users] Gridengine + Open MPI
Pak Lui wrote: Romaric David wrote: Pak Lui a écrit : It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3: Ok, so I have to wait for a 1.3 version to work with job suspend, or will it be back-ported to 1.2.6 or 1.2.6 ? I believe it will definitely be in 1.3 series, I am not sure about v1.2 at this point. So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses. It looks like shepherd cannot be started for a reason I couldn't get yet. /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [hostname:16745] Romaric, I just made a fix for the problem I've shown below in r18844. I think it is essentially the same problem that you are running into here. Please let me know if you still see the problem with the SGE tight integration job errors out. And I'll look at the suspend/resume feature later on. Regards, Romaric How recent is the build that you use to generate the error above? I assume you are using a trunk build? I didn't see the complete error messages that you are seeing, but I think I am running into the same exact error too. It seems to be a weird error that points out that the 'ssh' component not found. I don't believe there's a component named 'ssh' here, because ssh and rsh shared the same component. Well, it looks like something is broken in the plm that is responsible for launching the tight integration job for SGE. I checked it used to work without problem with my earlier trunk build (r18645). I have to find out what has happened since... Starting server daemon at host "burl-ct-v440-4" Server daemon successfully started with task id "1.burl-ct-v440-4" Establishing /opt/sge/utilbin/sol-sparc64/rsh session to host burl-ct-v440-4 ... [burl-ct-v440-4:13749] mca: base: components_open: Looking for plm components -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: burl-ct-v440-4 Framework: plm Component: ssh -- [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 70 [burl-ct-v440-4:13749] -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_open failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at line 135 [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 132 [burl-ct-v440-4:13749] -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file orted/orted_main.c at line 311 /opt/sge/utilbin/sol-sparc64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [burl-ct-v440-5:09789] -- A daemon (pid 9
Re: [OMPI users] ORTE_ERROR_LOG timeout
Several thins are going on here. First, this error message: > mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal > 6 (Aborted). > 2 additional processes aborted (not shown) indicates that your application procs are aborting for some reason. The system is then attempting to shutdown and somehow got itself "hung", hence the timeout error message. I'm not sure that increasing the timeout value will help in this situation. Unfortunately, 1.2.x has problems with this scenario (1.3 is -much- better! ;-)). If you want to try adjusting the timeout anyway, you can do so with: mpirun -mca orte_abort_timeout x ... where x is the specified timeout in seconds. Hope that helps. Ralph On 7/8/08 8:55 AM, "Alastair Basden" wrote: > Hi, > I've got some code that uses openmpi, and sometimes, it crashes, after > printing somthing like: > > [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at > line 1166 > [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line > 90 > mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal > 6 (Aborted). > 2 additional processes aborted (not shown) > [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 188 > [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at > line 1198 > -- > mpirun was unable to cleanly terminate the daemons for this job. Returned > value Timeout instead of ORTE_SUCCESS. > -- > > In this case, all processes were running on the same machine, so its not a > connection problem. Is this a bug, or something else wrong? Is there a > way to increase the timeout time? > > Thanks... > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] ORTE_ERROR_LOG timeout
Hi, I've got some code that uses openmpi, and sometimes, it crashes, after printing somthing like: [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal 6 (Aborted). 2 additional processes aborted (not shown) [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- In this case, all processes were running on the same machine, so its not a connection problem. Is this a bug, or something else wrong? Is there a way to increase the timeout time? Thanks...
Re: [OMPI users] Valgrind Functionality
On Mon, 2008-07-07 at 19:09 -0700, Tom Riddle wrote: > > I was attempting to get valgrind working with a simple MPI app > (osu_latency) on OpenMPI. While it appears to report uninitialized > values it fails to report any mallocs or frees that have been > conducted. The normal reason for this is either using static applications or having a very stripped glibc. It doesn't appear you've done the former as you are linking in libpthread but the latter is a possibility, you might benefit from installing the glibc-devel package. I don't recalled RHEL being the worst offenders at stripping libc however. > I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled > openmpi-1.3a1r18303. configured with > > $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc > CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker > --with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/ > As the FAQ's suggest I am running a later version of valgrind, > enabling the memchecker and debug. I tested a slightly modified > osu_latency test which has a simple char buffer malloc and free but > the valgrind summary shows no malloc/free activity whatsoever. This is > running on a dual node system using Infinipath HCAs. Here is a > trimmed output. Although you configured openmpi with what appears to be valgrind 3.3.0 the version of valgrind you are using is 3.2.1, perhaps you want to specify the full path of valgrind on the mpirun command line? > [tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile > valgrind ./osu_latency1 > ==17839== Memcheck, a memory error detector. > ==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et > al. > ==17839== Using LibVEX rev 1658, a library for dynamic binary > translation. > ==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. > ==17839== Using valgrind-3.2.1, a dynamic binary instrumentation > framework. > ==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et > al. > ==17839== For more details, rerun with: -v Ashley Pittman.